Re: hostname in task

2019-08-03 Thread James Peach



> On Aug 3, 2019, at 10:59 PM, Marc Roos  wrote:
> 
> 
> I read you can add a hostname option to the container in this issue[0], 
> however I still have the uuid. Is this in available in mesos 1.8?

Yep.

> Can I 
> somewhere read all these options? Like here[1]

The Mesos API is defined in the ContainerInfo protobuf, but I’m not sure how 
marathon maps that

https://github.com/apache/mesos/blob/master/include/mesos/v1/mesos.proto#L3395


> 
> 
> [@ cni]# cat 2f261fa8-4985-4614-b712-f0785ca6ce04/hosts
> 127.0.0.1 localhost
> 192.168.123.32 2f261fa8-4985-4614-b712-f0785ca6ce04
> 
> [0]
> https://reviews.apache.org/r/55191/
> [1]
> http://mesosphere.github.io/marathon/api-console/index.html
> 
> Using mesos 1.8
> And
> 
> "container": {
>"type": "MESOS",
>"hostname": "test.example.com",
>"docker": {
>"image": "test",
>"credential": null,
>"forcePullImage": true
>},
>   "volumes": [
>  {
>  "mode": "RW",
>  "containerPath": "/dev/log",
>  "hostPath": "/dev/log" 
>  }
>  ]
>  },



Re: On adding a debug endpoint for Mesos containerizer

2019-06-05 Thread James Peach
I really like this proposal and I think that it would help opertional teams a 
lot. Let’s make sure that it is well documented :)

> On Jun 5, 2019, at 1:05 AM, Andrei Budnik  wrote:
> 
> Hi folks,
> 
> We have been encountering container stuck issues for quite a long time. Some 
> of these issues are caused by external components such as CNI/CSI plugins, 
> custom Mesos modules, etc. Also, there were cases when a container become 
> stuck due to a Linux kernel bug. All these kinds of issues make it difficult 
> to debug container stuck issues.
> 
> We are proposing a container debug endpoint for the Mesos agent [1], which is 
> based on a new mechanism for tracking pending libprocess futures [2].
> 
> Please review both of them.
> 
> [1] Container debug endpoint: 
> https://docs.google.com/document/d/1VtlKD6b8a22HzSdaJUeI7cPGuKd01vLwBJT4XfkeUDI
> [2] Tracking libprocess futures: 
> https://docs.google.com/document/d/1Unu2pe0dRq3Z6XQ5S8lWZm2cU2REjfkUj0xk2ePQ0MY



Re: ssl mesos-executor not using /etc/default/mesos

2019-02-18 Thread James Peach



> On Feb 16, 2019, at 9:46 AM, Marc Roos  wrote:
> 
> 
> 
> Looks like the mesos-executor is not using /etc/default/mesos 
> environment variables

Depending on your configuration, the executor runs inside the container, which 
means that /etc/default/mesos is probably not available. 

> 
> If I export the variables in /etc/default/mesos manually, I can run the 
> task. 
> 
> mesos-execute --master=x.x.x.x:5050 --principal=xxx --secret=xxx 
> --name=ls --command="ls -lRrt /*; sleep 60" --env=file:///test/env.json
> 
> How should this be resolved? I tried setting 
> --executor_environment_variables=/etc/mesos/executor-env.json 

And what happened when you did this?

Re: How is running 1.7.0 in production?

2018-11-13 Thread James Peach


> On Nov 13, 2018, at 5:45 PM, Stuart Elston  wrote:
> 
> Hi everyone,
> 
> We are contemplating an upgrade to Mesos 1.7.0 but are generally a little 
> wary of running .0 releases.  Has anyone encountered any showstoppers while 
> running 1.7.0?  We'd be curious to hear your experiences!

I’ve been running something slightly pre- the 1.7.0 release tag in prod for a 
long time and it’s fine. I’m currently rolling out a post- 1.7.0 snapshot and 
that’s going well so far.

J

[ANNOUNCE] mesos_exporter 1.1.1 released

2018-10-25 Thread James Peach
Hi all,

Just a quick note to say that mesos_exporter 1.1.1 has been released. This is a 
bug fix release that fixes a regression I introduced to v1.1.0. Source code an 
binaries are available on Github.

https://github.com/mesos/mesos_exporter/releases/tag/v1.1.1

Thanks to Chase Sillevis who contributed the fix for this release.

cheers,
James

Re: Propose to run debug container as the same user of its parent container by default

2018-10-25 Thread James Peach


> On Oct 23, 2018, at 7:47 PM, Qian Zhang  wrote:
> 
> Hi all,
> 
> Currently when launching a debug container (e.g., via `dcos task exec` or 
> command health check) to debug a task, by default Mesos agent will use the 
> executor's user as the debug container's user. There are actually 2 cases:
> 1. Command task: Since the command executor's user is same with command 
> task's user, so the debug container will be launched as the same user of the 
> command task.
> 2. The task in a task group: The default executor's user is same with the 
> framework user, so in this case the debug container will be launched as the 
> same user of the framework rather than the task.
> 
> Basically I think the behavior of case 1 is correct. For case 2, we may run 
> into a situation that the task is run as a user (e.g., root), but the debug 
> container used to debug that task is run as another user (e.g., a normal 
> user, suppose framework is run as a normal user), this may not be what user 
> expects.
> 
> So I created MESOS-9332  
> and propose to run debug container as the same user of its parent container 
> (i.e., the task to be debugged) by default. Please let me know if you have 
> any comments, thanks!

This sounds like a sensible default to me. I can imagine for debug use cases 
you might want to run the debug container as root or give it elevated 
capabilities, but that should not be the default.

J

Re: [VOTE] Release Apache Mesos 1.7.0 (rc3)

2018-09-14 Thread James Peach
+1 (binding)

make check on Fedora 28

> On Sep 11, 2018, at 11:09 AM, Gastón Kleiman  wrote:
> 
> Hi all,
> 
> Please vote on releasing the following candidate as Apache Mesos 1.7.0.
> 
> 
> 1.7.0 includes the following:
> 
> * Performance Improvements:
>   * Master `/state` endpoint: ~130% throughput improvement through RapidJSON
>   * Allocator: Improved allocator cycle significantly
>   * Agent `/containers` endpoint: Fixed a performance issue
>   * Agent container launch / destroy throughput is significantly improved
> * Containerization:
>   * **Experimental** Supported docker image tarball fetching from HDFS
>   * Added new `cgroups/all` and `linux/devices` isolators
>   * Added metrics for `network/cni` isolator and docker pull latency
> * Windows:
>   * Added support to libprocess for the Windows Thread Pool API
> * Multi-Framework Workloads:
>   * **Experimental** Added per-framework metrics to the master
>   * A new weighted random sorter was added as an alternative to the DRF sorter
> 
> The CHANGELOG for the release is available at:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.0-rc3
>  
> 
> 
> 
> The candidate for Mesos 1.7.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc3/mesos-1.7.0.tar.gz 
> 
> 
> The tag to be voted on is 1.7.0-rc3:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc3 
> 
> 
> The SHA512 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc3/mesos-1.7.0.tar.gz.sha512
>  
> 
> 
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc3/mesos-1.7.0.tar.gz.asc 
> 
> 
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS 
> 
> 
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1234 
> 
> 
> Please vote on releasing this package as Apache Mesos 1.7.0!
> 
> The vote is open until Fri Sep 14 11:06:30 PDT 2018 and passes if a majority 
> of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Mesos 1.7.0
> [ ] -1 Do not release this package because ...
> 
> Thanks,
> 
> Chun-Hung & Gastón



Re: Libevent bundling ahead.

2018-09-12 Thread James Peach



> On Sep 11, 2018, at 6:14 PM, Till Toenshoff  wrote:
> 
> Hey All,
> 
> We are considering bundling/vendoring libevent 2.0.22 with upcoming releases 
> of Mesos.
> 
> Let me explain the motivation and then go into some details.
> 
> Due to https://issues.apache.org/jira/browse/MESOS-7076, SSL builds Mesos 
> stopped functioning on distributions that offer libevent 2.1.8 by default. 
> Specifically the failure was observed on Ubuntu 17/18 as well as on macOS. It 
> has also just come to my attention that Fedora 18 shares the same fate.

F28

> So the problem is less likely OS specific but more likely libevent + SSL + 
> libprocess specific.
> Instead of getting stuck in the rabbit hole of debugging right away, I 
> decided that bundling a known good version of libevent was the most reliable 
> way to prevent sad faces when building Mesos with SS but instead we can be 
> sure SSL builds of Mesos function properly across all supported platforms, 
> out of the box.
> 
> Details on the bundling;
> We will include libevent 2.0.22 and we also include a patch that makes that 
> version build against both openssl 1.0.x as well as 1.1.x. For unbundled 
> builds (--with-libevent) I have some additional checks foreseen that try to 
> prevent a build of a known bad variant of libevent + SSL + Mesos.
> 
> The bundling and those checks are a workaround, not a solution. I still am 
> pursueing debugging the underlying cause. However, way too much time has 
> passed already without a proper solution, hence this suggestion of a quick 
> fix, bundling workaround.
> 
> Let me know your thoughts!

I think this is OK as long as we have a reasonable expectation that we can 
unbundle soon-ish.

J

Re: make check failed, but mesos-tests.sh --gtest_filter="SVNTest.DiffPatch" tests passed

2018-09-04 Thread James Peach
This might be caused by inconsistent linking in Homebrew. Try forcing Homebrew 
to build svn from source, something like this: brew install --force 
--build-from-source subversion


> On Sep 4, 2018, at 2:29 AM, Chang Shawn  wrote:
> 
> After 'make' succesfully on my macOS 10.13.6, I run 'make check', but fail on 
> test case "SVNTest.DiffPatch".The error output is:
> 
> [--] 2 tests from SVNTest
> 
> [ RUN  ] SVNTest.DiffPatch
> 
> *** Aborted at 1536051660 (unix time) try "date -d @1536051660" if you are 
> using GNU date ***
> 
> PC: @0x1094239d6 apr_pool_create_ex
> 
> *** SIGSEGV (@0x30) received by PID 84174 (TID 0x7fff8a2b6380) stack trace: 
> ***
> 
> @ 0x7fff51ab0f5a _sigtramp
> 
> @0x0 (unknown)
> 
> @0x10922380e svn_pool_create_ex
> 
> @0x107e13f4e svn::diff()
> 
> @0x107e133eb SVNTest_DiffPatch_Test::TestBody()
> 
> @0x107fbbebe 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> 
> @0x107f5c01b 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> 
> @0x107f5bf46 testing::Test::Run()
> 
> @0x107f5dd5d testing::TestInfo::Run()
> 
> @0x107f5f38c testing::TestCase::Run()
> 
> @0x107f6fbac testing::internal::UnitTestImpl::RunAllTests()
> 
> @0x107fbf14e 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> 
> @0x107f6f5db 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> 
> @0x107f6f49c testing::UnitTest::Run()
> 
> @0x107b51ab1 RUN_ALL_TESTS()
> 
> @0x107b51825 main
> 
> @ 0x7fff517a2015 start
> 
> make[6]: *** [check-local] Segmentation fault: 11
> 
> make[5]: *** [check-am] Error 2
> 
> make[4]: *** [check-recursive] Error 1
> 
> make[3]: *** [check] Error 2
> 
> make[2]: *** [check-recursive] Error 1
> 
> make[1]: *** [check] Error 2
> 
> make: *** [check-recursive] Error 1
> 
> So I run with ./bin/mesos-tests.sh --gtest_filter="SVNTest.DiffPatch" try to 
> get more information, but it seems that tests passed:

The SVN tests are part of stout (but are run during make check):

 ./3rdparty/stout/stout-tests --gtest_list_tests 
--gtest_filter="SVNTest.DiffPatch"
SVNTest.
  DiffPatch

J

Re: [VOTE] Release Apache Mesos 1.7.0 (rc2)

2018-08-29 Thread James Peach
+1 (binding)

Built and tested on Fedora 28 (clang).

> On Aug 24, 2018, at 4:42 PM, Chun-Hung Hsiao  wrote:
> 
> Hi all,
> 
> Please vote on releasing the following candidate as Apache Mesos 1.7.0.
> 
> 
> 1.7.0 includes the following:
> 
> * Performance Improvements:
>   * Master `/state` endpoint: ~130% throughput improvement through RapidJSON
>   * Allocator: Improved allocator cycle significantly
>   * Agent `/containers` endpoint: Fixed a performance issue
>   * Agent container launch / destroy throughput is significantly improved
> * Containerization:
>   * **Experimental** Supported docker image tarball fetching from HDFS
>   * Added new `cgroups/all` and `linux/devices` isolators
>   * Added metrics for `network/cni` isolator and docker pull latency
> * Windows:
>   * Added support to libprocess for the Windows Thread Pool API
> * Multi-Framework Workloads:
>   * **Experimental** Added per-framework metrics to the master
>   * A new weighted random sorter was added as an alternative to the DRF sorter
> 
> The CHANGELOG for the release is available at:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.0-rc2
>  
> 
> 
> 
> The candidate for Mesos 1.7.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz 
> 
> 
> The tag to be voted on is 1.7.0-rc2:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc2 
> 
> 
> The SHA512 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz.sha512
>  
> 
> 
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz.asc 
> 
> 
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS 
> 
> 
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1233 
> 
> 
> Please vote on releasing this package as Apache Mesos 1.7.0!
> 
> The vote is open until Mon Aug 27 16:37:35 PDT 2018 and passes if a majority 
> of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Mesos 1.7.0
> [ ] -1 Do not release this package because ...
> 
> Thanks,
> Chun-Hung & Gaston



[ANNOUNCE] mesos_exporter 1.1.0 released

2018-08-23 Thread James Peach
Hi all,

I'm please to announce that mesos_exporter 1.1.0 has been released. This is a 
minor release with  collection of bug fixes and minor features.

https://github.com/mesos/mesos_exporter/releases/tag/v1.1.0

Many thanks to Mesosphere, who kindly contributed the code to the Mesos 
community, and to the following contributors: Alan Bover, Eric Lubow, Hector 
Fernandez, Jack Thomasson, James Peach, Jonathan Sokolowski, Philip Norman, 
Stephan Erb, Trevor Wood and Vinod Kone.

cheers,
James

Re: Volume ownership and permission

2018-08-16 Thread James Peach



> On Aug 15, 2018, at 6:22 PM, Qian Zhang  wrote:
> 
> Hi Folks,
> 
> We found some issues for the solutions of this project and propose a better
> one, see here
> <https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.tjuy5xk67tuu>
> for details. Please let me know if you have any comments, thanks!

Some general comments.

I assume that this scheme will only be supported on Linux, due to the 
dependencies on the Linux ACLs and supplementary group behaviour?  

Rewriting ACLs on volumes at each container launch sounds hugely expensive. 
It's IOP-bound process and there are an effectively unbounded number of files 
in the volume. Would this serialize container cleanup?

It seems like ACL evaluation will mean that this scheme will only mostly work. 
For example, if the container process UID matches a user ACE, then access could 
be denied independently of the volume policy.

Will the VolumeAclManager apply a default ACL on the root of the volume? Does 
this imply that when it updates the ACEs for the container GID, it also needs 
to update the default ACLs on all directories?

> 
> 
> Regards,
> Qian Zhang
> 
> On Sat, Apr 28, 2018 at 7:57 AM, Qian Zhang  wrote:
> 
>>> The framework launched tasks in a group with different users? Sounds
>> like they dug their own hole :)
>> 
>> So you mean we should actually put a best practice or limitation in doc:
>> when launching a task group with multiple tasks to share a SANDBOX volume
>> of PARENT type, all the tasks should be run with the same user, and that
>> user must be same with the user to launch the executor? Otherwise the task
>> will not be able to write to the volume.
>> 
>>> I'd argue that the "rw" on the sandbox path is analogous to the "rw"
>> mount option. That is, it is mounted writeable, but says nothing about
>> which credentials can write to it.
>> 
>> Can you please elaborate a bit on this? What would you suggest for the
>> "rw` volume mode?
>> 
>> 
>> Regards,
>> Qian Zhang
>> 
>> On Fri, Apr 27, 2018 at 12:07 PM, James Peach  wrote:
>> 
>>> 
>>> 
>>>> On Apr 26, 2018, at 7:25 PM, Qian Zhang  wrote:
>>>> 
>>>> Hi James,
>>>> 
>>>> Thanks for your comment!
>>>> 
>>>> I think you are talking about the SANDBOX_PATH volume ownership issue
>>>> mentioned in the design doc
>>>> <https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE
>>> -v4KWwjmnCR0l8V4Tq2U/edit#heading=h.s6f8rmu65g2p>,
>>>> IIUC, you prefer to leaving it to framework, i.e., framework itself
>>> ought
>>>> to be able to handle such issue. But I am curious how framework can
>>> handle
>>>> it in such situation. If the framework launches a task group with
>>> different
>>>> users and with a SANDBOX_PATH volume of PARENT type, the tasks in the
>>> group
>>>> will definitely fail to write to the volume due to the ownership issue
>>>> though the volume's mode is set to "rw". So in this case, how should
>>>> framework handle it?
>>> 
>>> The framework launched tasks in a group with different users? Sounds like
>>> they dug their own hole :)
>>> 
>>> I'd argue that the "rw" on the sandbox path is analogous to the "rw"
>>> mount option. That is, it is mounted writeable, but says nothing about
>>> which credentials can write to it.
>>> 
>>>> And if we want to document it, what is our recommended
>>>> solution in the doc?
>>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> Qian Zhang
>>>> 
>>>> On Fri, Apr 27, 2018 at 1:16 AM, James Peach  wrote:
>>>> 
>>>>> I commented on the doc, but at least some of the issues raised there I
>>>>> would not regard as issues. Rather, they are about setting expectations
>>>>> correctly and ensuring that we are documenting (and maybe enforcing)
>>>>> sensible behavior.
>>>>> 
>>>>> I'm not that keen on Mesos automatically "fixing" filesystem
>>> permissions
>>>>> and we should proceed down that path with caution, especially in the
>>> ACLs
>>>>> case.
>>>>> 
>>>>>> On Apr 10, 2018, at 3:15 AM, Qian Zhang  wrote:
>>>>>> 
>>>>>> Hi Folks,
>>>>>> 
>>>>>> I am working on MESOS-8767 to improve Mesos volume support regarding
>>>>> volume ownership and permission, here is the design doc. Please feel
>>> free
>>>>> to let me know if you have any comments/feedbacks, you can reply this
>>> mail
>>>>> or comment on the design doc directly. Thanks!
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Qian Zhang
>>>>> 
>>>>> 
>>> 
>>> 
>> 



Re: [VOTE] Move the project repos to gitbox

2018-07-17 Thread James Peach



> On Jul 17, 2018, at 7:58 AM, Vinod Kone  wrote:
> 
> Hi,
> 
> As discussed in another thread and in the committers sync, there seem to be 
> heavy interest in moving our project repos ("mesos", "mesos-site") from the 
> "git-wip" git server to the new "gitbox" server to better avail GitHub 
> integrations.
> 
> Please vote +1, 0, -1 regarding the move to gitbox. The vote will close in 3 
> business days.


+1

implicit mesos-local support in scheduler drivers

2018-07-03 Thread James Peach
Hi all,

I found recently, that the Mesos scheduler drivers will implicitly spin up a 
`mesos-local` cluster for testing if your scheduler uses the Mesos scheduler 
drivers, specifies “local” as the master, and exports “MESOS_" environment 
variables to configure the master. Do any scheduler authors use this? If so, 
can you desribe the workflow?

thanks,
James

Re: narrowing task sandbox permissions

2018-06-15 Thread James Peach



> On Jun 15, 2018, at 11:06 AM, Zhitao Li  wrote:
> 
> Sorry for getting back to this really late, but we got bit by this behavior
> change in our environment.
> 
> The broken scenario we had:
> 
>   1. We are using Aurora to launch docker containerizer based tasks on
>   Mesos;
>   2. Most of our docker containers had some legacy behavior: *the
>   execution entered as "root" in the entry point script,* setup a couple
>   of symlinks and other preparation work, then *"de-escalate" into a non
>   privileged user (i.e, "user")*;
>  1. This was added so that the entry point script has enough
>  permission to reconfigure certain side car processes (i.e, nginx) and
>  filesystem paths;
>   3. unfortunately, the "user" user will lose access to the sandbox after
>   this change.
> 
> 
> While I'd acknowledge that above behavior is legacy and a piece of major
> tech debt, cleaning it up for the thousands of applications on our platform
> was never easy. Given that our org has other useful features available in
> 1.6, I would propose a couple of options:
> 
>   1. making the sandbox permission bits configurable
>  1. Certain framework knows that their tasks do not leave sensitive
>  data on sandbox so we could provide this flexibility (it's very useful in
>  practice for migration to a container based system);
>  2. Alternatively, making this possible to reconfigure on agent flags:
>  This could be more secure and easier to manage, but lacks flexibility of
>  allowing different frameworks to do different things.
>   2. Until the customization is in place, consider a revert of the
>   permission bit change so we preserve the original behavior.

That's a pretty unfortunate outcome. Can you change the permissions in your 
script, or happy a Mesos patch until the legacy can be addressed?

J

Re: Deprecating the Python bindings

2018-06-06 Thread James Peach


> On May 9, 2018, at 11:51 AM, Andrew Schwartzmeyer  
> wrote:
> 
> Hi all,
> 
> There are two parallel efforts underway that would both benefit from 
> officially deprecating (and then removing) the Python bindings. The first 
> effort is the move to the CMake system: adding support to generate the Python 
> bindings was investigated but paused (see MESOS-8118), and the second effort 
> is the move to Python 3: producing Python 3 compatible bindings is under 
> investigation but not in progress (see MESOS-7163).
> 
> Benjamin Bannier, Joseph Wu, and I have all at some point just wondered how 
> the community would fare if the Python bindings were officially deprecated 
> and removed. So please, if this would negatively impact you or your project, 
> let me know in this thread.

Another approach could be to move the bindings from the `mesos` git repo to a 
separate repo (either the ASF or in the `mesos` GitHub org). This could 
decouple it from the main Mesos build infrastructure and create a project for a 
Python community to coalesce around. I think there's value in nominating an 
official Python binding, but maybe we don't have to carry that in the same git 
repo and build system.

J

Re: Volume ownership and permission

2018-04-26 Thread James Peach


> On Apr 26, 2018, at 7:25 PM, Qian Zhang <zhq527...@gmail.com> wrote:
> 
> Hi James,
> 
> Thanks for your comment!
> 
> I think you are talking about the SANDBOX_PATH volume ownership issue
> mentioned in the design doc
> <https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.s6f8rmu65g2p>,
> IIUC, you prefer to leaving it to framework, i.e., framework itself ought
> to be able to handle such issue. But I am curious how framework can handle
> it in such situation. If the framework launches a task group with different
> users and with a SANDBOX_PATH volume of PARENT type, the tasks in the group
> will definitely fail to write to the volume due to the ownership issue
> though the volume's mode is set to "rw". So in this case, how should
> framework handle it?

The framework launched tasks in a group with different users? Sounds like they 
dug their own hole :)

I'd argue that the "rw" on the sandbox path is analogous to the "rw" mount 
option. That is, it is mounted writeable, but says nothing about which 
credentials can write to it.

> And if we want to document it, what is our recommended
> solution in the doc?
> 
> 
> 
> Regards,
> Qian Zhang
> 
> On Fri, Apr 27, 2018 at 1:16 AM, James Peach <jpe...@apache.org> wrote:
> 
>> I commented on the doc, but at least some of the issues raised there I
>> would not regard as issues. Rather, they are about setting expectations
>> correctly and ensuring that we are documenting (and maybe enforcing)
>> sensible behavior.
>> 
>> I'm not that keen on Mesos automatically "fixing" filesystem permissions
>> and we should proceed down that path with caution, especially in the ACLs
>> case.
>> 
>>> On Apr 10, 2018, at 3:15 AM, Qian Zhang <zhq527...@gmail.com> wrote:
>>> 
>>> Hi Folks,
>>> 
>>> I am working on MESOS-8767 to improve Mesos volume support regarding
>> volume ownership and permission, here is the design doc. Please feel free
>> to let me know if you have any comments/feedbacks, you can reply this mail
>> or comment on the design doc directly. Thanks!
>>> 
>>> 
>>> Regards,
>>> Qian Zhang
>> 
>> 



Re: Volume ownership and permission

2018-04-26 Thread James Peach
I commented on the doc, but at least some of the issues raised there I would 
not regard as issues. Rather, they are about setting expectations correctly and 
ensuring that we are documenting (and maybe enforcing) sensible behavior. 

I'm not that keen on Mesos automatically "fixing" filesystem permissions and we 
should proceed down that path with caution, especially in the ACLs case.

> On Apr 10, 2018, at 3:15 AM, Qian Zhang  wrote:
> 
> Hi Folks,
> 
> I am working on MESOS-8767 to improve Mesos volume support regarding volume 
> ownership and permission, here is the design doc. Please feel free to let me 
> know if you have any comments/feedbacks, you can reply this mail or comment 
> on the design doc directly. Thanks!
> 
> 
> Regards,
> Qian Zhang



Re: Update the *Minimum Linux Kernel version* supported on Mesos

2018-04-05 Thread James Peach


> On Apr 5, 2018, at 5:00 AM, Andrei Budnik  wrote:
> 
> Hi All,
> 
> We would like to update minimum supported Linux kernel from 2.6.23 to
> 2.6.28.
> Linux kernel supports cgroups v1 starting from 2.6.24, but `freezer` cgroup
> functionality was merged into 2.6.28, which supports nested containers.

User namespaces require >= 3.12 (November 2013). Can we make that the minimum?

J

Re: Support deadline for tasks

2018-03-23 Thread James Peach


> On Mar 23, 2018, at 9:57 AM, Renan DelValle  wrote:
> 
> Hi Zhitao,
> 
> Since this is something that could potentially be handled by the executor 
> and/or framework, I was wondering if you could speak to the advantages of 
> making this a TaskInfo primitive vs having the executor (or even the 
> framework) handle it.

There's some discussion around this on 
https://issues.apache.org/jira/browse/MESOS-8725.

My take is that delegating too much to the scheduler makes schedulers harder to 
write and exacerbates the complexity of the system. If 4 different schedulers 
implement this feature, operators are likely to need to understand 4 different 
ways of doing the same thing, which would be unfortunate. 

J

Re: Support deadline for tasks

2018-03-22 Thread James Peach


> On Mar 22, 2018, at 10:06 AM, Zhitao Li  wrote:
> 
> In our environment, we run a lot of batch jobs, some of which have tight 
> timeline. If any tasks in the job runs longer than x hours, it does not make 
> sense to run it anymore. 
>  
> For instance, a team would submit a job which builds a weekly index and 
> repeats every Monday. If the job does not finish before next Monday for 
> whatever reason, there is no point to keep any task running.
>  
> We believe that implementing deadline tracking distributed across our cluster 
> makes more sense as it makes the system more scalable and also makes our 
> centralized state machine simpler.
>  
> One idea I have right now is to add an  optional TimeInfo deadline to 
> TaskInfo field, and all default executors in Mesos can simply terminate the 
> task and send a proper StatusUpdate.
> 
> I summarized above idea in MESOS-8725.
> 
> Please let me know what you think. Thanks! 

This sounds both useful and simple to implement. I’m happy to shepherd if you’d 
like

J

Re: Build Failure

2018-03-19 Thread James Peach


> On Mar 19, 2018, at 4:38 PM, Shiv Deepak  wrote:
> 
> Thanks. I installed unzip. That worked.

FWIW the test suite was fixed for 1.6 in 
0da7b6cc37786df94465ae98948fd7be669a843e.

> 
> On Mon, Mar 19, 2018 at 3:48 PM, Tomek Janiszewski  wrote:
> Do you have unzip installed? Can you try unzipping file like it's done in the 
> test? 
> 
> 
> pon., 19.03.2018, 22:53 użytkownik Shiv Deepak  napisał:
> Hello,
> 
> I am trying to build Mesos 1.5.0 from source on Ubuntu 16.04.
> 
> I tried on Docker, VM, and EC2. Three test cases are failing no matter what.
> 
> Here is the list.
> 
> [  PASSED  ] 1904 tests.
> [  FAILED  ] 3 tests, listed below:
> [  FAILED  ] FetcherTest.Unzip_ExtractFile
> [  FAILED  ] FetcherTest.Unzip_ExtractInvalidFile
> [  FAILED  ] FetcherTest.Unzip_ExtractFileWithDuplicatedEntries
> 
> Here is the test output:
> 
> [ RUN  ] FetcherTest.Unzip_ExtractFile
> ../../src/tests/fetcher_tests.cpp:870: Failure
> (fetch).failure(): Failed to fetch all URIs for container 
> '709de28f-5f71-439d-a032-072df865090f': exited with status 1
> [  FAILED  ] FetcherTest.Unzip_ExtractFile (297 ms)
> [ RUN  ] FetcherTest.Unzip_ExtractInvalidFile
> ../../src/tests/fetcher_tests.cpp:936: Failure
> Value of: os::exists(extractedFile)
>   Actual: false
> Expected: true
> [  FAILED  ] FetcherTest.Unzip_ExtractInvalidFile (201 ms)
> [ RUN  ] FetcherTest.Unzip_ExtractFileWithDuplicatedEntries
> ../../src/tests/fetcher_tests.cpp:997: Failure
> (fetch).failure(): Failed to fetch all URIs for container 
> 'dd749015-3d16-4926-b7f3-e1c96211a461': exited with status 1
> [  FAILED  ] FetcherTest.Unzip_ExtractFileWithDuplicatedEntries (201 ms)
> 
> Is this expected or do I need to fix something? Can someone please point me 
> in the right direction?
> 
> Thank you
> 
> -- 
> 
> Shiv Deepak▌
> Engineering Manager
> HackerRank
> 
> Blog / Twitter / Linkedin / Facebook
> 
> 
> 
> 
> -- 
> 
> Shiv Deepak▌
> Engineering Manager
> HackerRank
> 
> Blog / Twitter / Linkedin / Facebook



Re: [VOTE] Release Apache Mesos 1.5.0 (rc2)

2018-02-07 Thread James Peach
+1 (binding)

Tested on Fedora 27

> On Feb 1, 2018, at 5:36 PM, Gilbert Song  wrote:
> 
> Hi all,
> 
> Please vote on releasing the following candidate as Apache Mesos 1.5.0.
> 
> 1.5.0 includes the following:
> 
>  * Support Container Storage Interface (CSI).
>  * Agent reconfiguration policy.
>  * Auto GC docker images in Mesos Containerizer.
>  * Standalone containers.
>  * Support gRPC client.
>  * Non-leading VOTING replica catch-up.
> 
> 
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.5.0-rc2
> 
> 
> The candidate for Mesos 1.5.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.0-rc2/mesos-1.5.0.tar.gz
> 
> The tag to be voted on is 1.5.0-rc2:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.5.0-rc2
> 
> The MD5 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.0-rc2/mesos-1.5.0.tar.gz.md5
> 
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.0-rc2/mesos-1.5.0.tar.gz.asc
> 
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
> 
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1222
> 
> Please vote on releasing this package as Apache Mesos 1.5.0!
> 
> The vote is open until Tue Feb  6 17:35:16 PST 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Mesos 1.5.0
> [ ] -1 Do not release this package because ...
> 
> Thanks,
> Jie and Gilbert



Re: [VOTE] Release Apache Mesos 1.5.0 (rc1)

2018-01-24 Thread James Peach
+1

Verified on CentOS 6 and Fedora 27

> On Jan 22, 2018, at 7:15 PM, Gilbert Song  wrote:
> 
> Hi all,
> 
> Please vote on releasing the following candidate as Apache Mesos 1.5.0.
> 
> 1.5.0 includes the following:
> 
>   * Support Container Storage Interface (CSI).
>   * Agent reconfiguration policy.
>   * Auto GC docker images in Mesos Containerizer.
>   * Standalone containers.
>   * Support gRPC client.
>   * Non-leading VOTING replica catch-up.
> 
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.5.0-rc1
> 
> 
> The candidate for Mesos 1.5.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.0-rc1/mesos-1.5.0.tar.gz
> 
> The tag to be voted on is 1.5.0-rc1:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.5.0-rc1
> 
> The MD5 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.0-rc1/mesos-1.5.0.tar.gz.md5
> 
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.0-rc1/mesos-1.5.0.tar.gz.asc
> 
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
> 
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1221
> 
> Please vote on releasing this package as Apache Mesos 1.5.0!
> 
> The vote is open until Thu Jan 25 18:24:36 PST 2018 and passes if a majority 
> of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Mesos 1.5.0
> [ ] -1 Do not release this package because ...
> 
> Thanks,
> Jie and Gilbert



Re: Doc-a-thon - January 11th, 2018

2018-01-09 Thread James Peach
Just a reminder that the Docathon is this Thursday :)

> On Nov 21, 2017, at 4:14 PM, Judith Malnick  wrote:
> 
> Hi all,
> 
> I'm excited to announce the next Apache Mesos doc-a-thon!
> 
> *Date:* January 11th, 2018
> 
> Location:
> 
> Mesosphere HQ
> 
> 88 Stevenson Street
> 
> San Francisco, CA
> 
> Schedule (Pacific time):
> 
> 3 - 3:30 PM: Discuss docs projects, split into groups
> 
> 3:30 - 6:30 PM: Work on docs
> 
> 6:30 - 7 PM: Present progress
> 
> 7 - 8 PM: Drinks and hangout!
> 
> 
> If you will be attending in person, please RSVP
>  so we
> know how much food to get.
> If you plan on attending remotely, you can with this Zoom link
> .
> Feel free to brainstorm project proposals on this planning doc
> .
> 
> 
> Let me know if you have any questions. I'm looking forward to seeing all of
> you and your amazing projects!
> 
> All the Best,
> Judith
> -- 
> Judith Malnick
> Community Manager
> 310-709-1517



Re: Container user '27' is not supported

2017-12-25 Thread James Peach


> On Dec 25, 2017, at 2:22 PM, Marc Roos <m.r...@f1-outsourcing.eu> wrote:
> 
> 
> Should this be done via the parameters? What key?
> 
> "parameters": [{ "key": "net", "value": "host" }]
> 
> 
> {
>  "id": "sflow/vizceral",
>  "cmd": null,
>  "cpus": 0.2,
>  "mem": 256,
>  "instances": 1,
>  "acceptedResourceRoles": ["*"],
>  "constraints": [["hostname", "CLUSTER", "m02.local"]],
>  "container": {
>"type": "MESOS",
>"docker": {
>  "image": "sflow/vizceral",
>  "credential": null,
>  "forcePullImage": false
>}
> 
>  }
> }

I guess this is a Marathon task spec? I’m not familiar with the Marathon API, 
but it looks to me like you would specify the “user” field in application:

https://docs.mesosphere.com/1.9/deploying-services/marathon-api/#/apps/V2Apps3

> 
> 
> Dec 25 23:15:40 m02 mesos-slave[18569]: W1225 23:15:40.251715 18595 
> runtime.cpp:111] Container user 'sflowrt' is not supported yet for 
> container 375b21ca-2d12-4a81-8429-897aac75eaa0
> Dec 25 23:15:40 m02 mesos-slave[18569]: W1225 23:15:40.251715 18595 
> runtime.cpp:111] Container user 'sflowrt' is not supported yet for 
> container 375b21ca-2d12-4a81-8429-897aac75eaa0
> 
> -Original Message-
> From: James Peach [mailto:jor...@gmail.com] 
> Sent: zondag 24 december 2017 18:01
> To: user
> Subject: Re: Container user '27' is not supported
> 
> 
> 
>> On Dec 24, 2017, at 5:20 AM, Marc Roos <m.r...@f1-outsourcing.eu> 
> wrote:
>> 
>> 
>> I am seeing this in the logs:
>> 
>> Container user '27' is not supported yet for container
>> d823196a-4ec3-41e3-a4c0-6680ba5cc99
>> 
>> I guess this means that the container requests to run under a specific 
> 
>> user id, and this is not yet available in mesos?
> 
> This means that the containerizer parsed the continaer user out of the 
> manifest, but we don’t support running the container as that user. You 
> should continue to use the TaskInfo message to specify which user the 
> container will run as.
> 
> J
> 



Re: Container user '27' is not supported

2017-12-24 Thread James Peach


> On Dec 24, 2017, at 5:20 AM, Marc Roos  wrote:
> 
> 
> I am seeing this in the logs:
> 
> Container user '27' is not supported yet for container 
> d823196a-4ec3-41e3-a4c0-6680ba5cc99
> 
> I guess this means that the container requests to run under a specific 
> user id, and this is not yet available in mesos?

This means that the containerizer parsed the continaer user out of the 
manifest, but we don’t support running the container as that user. You should 
continue to use the TaskInfo message to specify which user the container will 
run as.

J

narrowing task sandbox permissions

2017-12-14 Thread James Peach
Hi all,

In https://issues.apache.org/jira/browse/MESOS-8332, I'm proposing a change to 
narrow the permissions used for the task sandbox directory from 0755 to 0750. 
Note that this change also makes failure to chown this directory into a hard 
failure.

I expect this is a safe change for well-behaved configurations, but please let 
me know if you have any compatibility concerns.

thanks,
James

Re: Adding a new agent terminates existing executors?

2017-11-15 Thread James Peach

> On Nov 15, 2017, at 8:24 AM, Dan Leary  wrote:
> 
> Yes, as I said at the outset, the agents are on the same host, with different 
> ip's and hostname's and work_dir's.
> If having separate work_dirs is not sufficient to keep containers separated 
> by agent, what additionally is required?

You might also need to specify other separate agent directories, like 
--runtime_dir, --docker_volume_checkpoint_dir, etc. Check the output of 
mesos-agent --flags.

> 
> 
> On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone  wrote:
> How is agent2 able to see agent1's containers? Are they running on the same 
> box!? Are they somehow sharing the filesystem? If yes, that's not supported.
> 
> On Wed, Nov 15, 2017 at 8:07 AM, Dan Leary  wrote:
> Sure, master log and agent logs are attached.
> 
> Synopsis:  In the master log, tasks t01 and t02 are running...
> 
> > I1114 17:08:15.972033  5443 master.cpp:6841] Status update TASK_RUNNING 
> > (UUID: 9686a6b8-b04d-4bc5-9d26-32d50c7b0f74) for task t01 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:19.142276  5448 master.cpp:6841] Status update TASK_RUNNING 
> > (UUID: a6c72f31-2e47-4003-b707-9e8c4fb24f05) for task t02 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> 
> Operator starts up agent2 around 17:08:50ish.  Executor1 and its tasks are 
> terminated
> 
> > I1114 17:08:54.835841  5447 master.cpp:6964] Executor 'executor1' of 
> > framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 
> > (agent1): terminated with signal Killed
> > I1114 17:08:54.835959  5447 master.cpp:9051] Removing executor 'executor1' 
> > with resources [] of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on 
> > agent 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 
> > (agent1)
> > I1114 17:08:54.837419  5436 master.cpp:6841] Status update TASK_FAILED 
> > (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t01 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:54.837497  5436 master.cpp:6903] Forwarding status update 
> > TASK_FAILED (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t01 
> > of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
> > I1114 17:08:54.837896  5436 master.cpp:8928] Updating the state of task 
> > t01 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest 
> > state: TASK_FAILED, status update state: TASK_FAILED)
> > I1114 17:08:54.839159  5436 master.cpp:6841] Status update TASK_FAILED 
> > (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t02 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:54.839221  5436 master.cpp:6903] Forwarding status update 
> > TASK_FAILED (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t02 
> > of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
> > I1114 17:08:54.839493  5436 master.cpp:8928] Updating the state of task 
> > t02 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest 
> > state: TASK_FAILED, status update state: TASK_FAILED)
> 
> But agent2 doesn't register until later...
> 
> > I1114 17:08:55.588762  5442 master.cpp:5714] Received register agent 
> > message from slave(1)@127.1.1.2:5052 (agent2)
> 
> Meanwhile in the agent1 log, the termination of executor1 appears to be the 
> result of the destruction of its container...
> 
> > I1114 17:08:54.810638  5468 containerizer.cpp:2612] Container 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 has exited
> > I1114 17:08:54.810732  5468 containerizer.cpp:2166] Destroying container 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state
> > I1114 17:08:54.810761  5468 containerizer.cpp:2712] Transitioning the state 
> > of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to DESTROYING
> 
> Apparently because agent2 decided to "recover" the very same container...
> 
> > I1114 17:08:54.775907  6041 linux_launcher.cpp:373] 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 is a known orphaned container
> > I1114 17:08:54.779634  6037 containerizer.cpp:966] Cleaning up orphan 
> > container cbcf6992-3094-4d0f-8482-4d68f68eae84
> > I1114 17:08:54.779705  6037 containerizer.cpp:2166] Destroying container 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state
> > I1114 17:08:54.779737  6037 containerizer.cpp:2712] Transitioning the state 
> > of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to DESTROYING
> > I1114 17:08:54.780740  6041 linux_launcher.cpp:505] Asked to destroy 
> > 

Re: 1.4.1 release

2017-11-03 Thread James Peach
I think MESOS-8169 is a candidate, but I don't be able to get to it until next 
week


> On Nov 3, 2017, at 1:48 AM, Qian Zhang  wrote:
> 
> And I will backport MESOS-8051 to 1.2.x, 1.3.x and 1.4.x.
> 
> 
> Regards,
> Qian Zhang
> 
> On Fri, Nov 3, 2017 at 9:01 AM, Qian Zhang  wrote:
> We want to backport https://reviews.apache.org/r/62518/ to 1.2.x, 1.3.x and 
> 1.4.x, James will work on it.
> 
> 
> Regards,
> Qian Zhang
> 
> On Fri, Nov 3, 2017 at 12:11 AM, Kapil Arya  wrote:
> Please reply to this email if you have pending patches to be backported to 
> 1.4.x as we are aiming to cut a release candidate for 1.4.1 early next week.
> 
> Thanks,
> Anand and Kapil
> 
> 



Re: clearing the executor authentication token from the task environment

2017-11-02 Thread James Peach

> On Nov 1, 2017, at 2:28 PM, James Peach <jor...@gmail.com> wrote:
> 
> Hi all,
> 
> In https://issues.apache.org/jira/browse/MESOS-8140, I'm proposing that we 
> clear the MESOS_EXECUTOR_AUTHENTICATION_TOKEN environment variable 
> immediately after consuming it in the built-in executors. This protects it 
> from observation by other tasks in the same PID namespace, however I wanted 
> to verify that no-one currently has a use case that depends on this. 
> Currently, the token is inherited to the environment of tasks running under 
> the command executor (i.e. not to task group tasks).
> 
> Eventually we would add a formal API for tasks to access the executor token 
> in MESOS-8018.

Ok, we will be landing this change for Mesos 1.5

thanks,
James

clearing the executor authentication token from the task environment

2017-11-01 Thread James Peach
Hi all,

In https://issues.apache.org/jira/browse/MESOS-8140, I'm proposing that we 
clear the MESOS_EXECUTOR_AUTHENTICATION_TOKEN environment variable immediately 
after consuming it in the built-in executors. This protects it from observation 
by other tasks in the same PID namespace, however I wanted to verify that 
no-one currently has a use case that depends on this. Currently, the token is 
inherited to the environment of tasks running under the command executor (i.e. 
not to task group tasks).

Eventually we would add a formal API for tasks to access the executor token in 
MESOS-8018.

thanks,
James

Re: Adding the limited resource to TaskStatus messages

2017-10-10 Thread James Peach

> On Oct 9, 2017, at 7:15 PM, Wil Yegelwel <wyegel...@gmail.com> wrote:
> 
> Is it correct to say that the limited resource field is *only* meant to 
> provide machine readable information about what resources limits were 
> exceeded?

Yes,

> If so, does it make sense to provide richer reporting fields for all failure 
> reasons? I imagine other failure reasons could benefit from being able to 
> report details of the failure that are machine readable.

Some other reasons already have their own structured information, eg. the 
TASK_UNREACHABLE state populates the `unreachable_time` field. I'm not planning 
to add structured information to any other failure reasons, but I'd support 
doing it if you have a specific suggestion.

> On Mon, Oct 9, 2017, 3:50 PM James Peach <jor...@gmail.com> wrote:
> 
> > On Oct 9, 2017, at 1:27 PM, Vinod Kone <vinodk...@apache.org> wrote:
> >
> >> In the case that a task is killed because it violated a resource
> >> constraint (ie. the reason field is REASON_CONTAINER_LIMITATION,
> >> REASON_CONTAINER_LIMITATION_DISK or REASON_CONTAINER_LIMITATION_MEMORY),
> >> this field may be populated with the resource that triggered the
> >> limitation. This is intended to give better information to schedulers about
> >> task resource failures, in the expectation that it will help them bubble
> >> useful information up to the user or a monitoring system.
> >>
> >
> > Can you elaborate what schedulers are expected to do with this information?
> > Looking for some concrete use cases if you can.
> 
> There's no concrete use case here; it's just a matter of propagating 
> information we know in a structured way.
> 
> If we assume that the scheduler knows about some sort of monitoring system or 
> has a UI, we can present this to the user or a system that can take action on 
> it. The status quo is that the raw message string is dumped to logs, and has 
> to be manually interpreted.
> 
> Additionally, this can pave the way to getting rid of 
> REASON_CONTAINER_LIMITATION_DISK and REASON_CONTAINER_LIMITATION_MEMORY. All 
> you really need is REASON_CONTAINER_LIMITATION plus the resource information.
> 
> J
> 



Re: Adding the limited resource to TaskStatus messages

2017-10-09 Thread James Peach

> On Oct 9, 2017, at 1:27 PM, Vinod Kone  wrote:
> 
>> In the case that a task is killed because it violated a resource
>> constraint (ie. the reason field is REASON_CONTAINER_LIMITATION,
>> REASON_CONTAINER_LIMITATION_DISK or REASON_CONTAINER_LIMITATION_MEMORY),
>> this field may be populated with the resource that triggered the
>> limitation. This is intended to give better information to schedulers about
>> task resource failures, in the expectation that it will help them bubble
>> useful information up to the user or a monitoring system.
>> 
> 
> Can you elaborate what schedulers are expected to do with this information?
> Looking for some concrete use cases if you can.

There's no concrete use case here; it's just a matter of propagating 
information we know in a structured way.

If we assume that the scheduler knows about some sort of monitoring system or 
has a UI, we can present this to the user or a system that can take action on 
it. The status quo is that the raw message string is dumped to logs, and has to 
be manually interpreted. 

Additionally, this can pave the way to getting rid of 
REASON_CONTAINER_LIMITATION_DISK and REASON_CONTAINER_LIMITATION_MEMORY. All 
you really need is REASON_CONTAINER_LIMITATION plus the resource information.

J



Adding the limited resource to TaskStatus messages

2017-10-09 Thread James Peach
Hi all,

In https://reviews.apache.org/r/62644/, I am proposing to add an optional 
Resources field to the TaskStatus message named `limited_resources`.

In the case that a task is killed because it violated a resource constraint 
(ie. the reason field is REASON_CONTAINER_LIMITATION, 
REASON_CONTAINER_LIMITATION_DISK or REASON_CONTAINER_LIMITATION_MEMORY), this 
field may be populated with the resource that triggered the limitation. This is 
intended to give better information to schedulers about task resource failures, 
in the expectation that it will help them bubble useful information up to the 
user or a monitoring system.

diff --git a/include/mesos/v1/mesos.proto b/include/mesos/v1/mesos.proto
index d742adbbf..559d09e37 100644
--- a/include/mesos/v1/mesos.proto
+++ b/include/mesos/v1/mesos.proto
@@ -2252,6 +2252,13 @@ message TaskStatus {
   // status updates for tasks running on agents that are unreachable
   // (e.g., partitioned away from the master).
   optional TimeInfo unreachable_time = 14;
+
+  // If the reason field indicates a container resource limitation,
+  // this field contains the resource whose limits were violated.
+  //
+  // NOTE: 'Resources' is used here because the resource may span
+  // multiple roles (e.g. `"mem(*):1;mem(role):2"`).
+  repeated Resource limited_resources = 16;
 }



cheers,
James




Re: RFC: Partition Awareness

2017-10-05 Thread James Peach

> On Jun 21, 2017, at 10:16 AM, Megha Sharma  wrote:
> 
> Thank you all for the feedback.
> To summarize, not killing tasks for non-Partition Aware frameworks will make 
> the schedulers see a higher volume of non terminal updates for tasks for 
> which they have already received a TASK_LOST but nothing new that they are 
> not seeing today. So, this shouldn’t be a breaking change for frameworks and 
> this will make the partition awareness logic simpler. I will update 
> MESOS-7215 with the details once the design is ready.

What happens for short-lived frameworks? That is, the lost task comes back, 
causing the master to track its framework as disconnected, but the framework is 
gone and will never return.

J

Re: Are there any supported systems without O_CLOEXEC?

2017-09-29 Thread James Peach

> On Sep 29, 2017, at 11:34 AM, Benjamin Mahler <bmah...@apache.org> wrote:
> 
> Is this altering the minimum Linux or OS X version we support?


I couldn't find a clear statement of what OS support we guarantee. OS X got 
O_CLOEXEC in 10.10. CentOS 6.9 has kernel 2.6.32, apparently Ubuntu 14.04 has 
3.19. Do we support anything older than that?

> 
> On Fri, Sep 29, 2017 at 9:15 AM, James Peach <jor...@gmail.com> wrote:
> 
>> 
>>> On Sep 27, 2017, at 5:03 PM, James Peach <jor...@gmail.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> In MESOS-8027 and https://reviews.apache.org/r/62638/, I'm claiming
>> that, in practice, we do not have any supported platforms that don't
>> implement O_CLOEXEC to open. All current Linux, FreeBSD and Solaris
>> versions implement O_CLOEXEC. Does anyone know of a platform that doesn't
>> have O_CLOEXEC that we ought to work on?
>>> 
>>> https://www.freebsd.org/cgi/man.cgi?sektion=2=open
>>> http://man7.org/linux/man-pages/man2/open.2.html
>>> https://docs.oracle.com/cd/E23824_01/html/821-1463/open-2.html
>>> https://developer.apple.com/legacy/library/documentation/
>> Darwin/Reference/ManPages/man2/open.2.html
>> 
>> Bump! If you run Mesos on a platform that doesn't support O_CLOEXEC (eg.
>> Linux kernel <= 2.6.23), please let us know!
>> 
>> J



Re: Are there any supported systems without O_CLOEXEC?

2017-09-29 Thread James Peach

> On Sep 27, 2017, at 5:03 PM, James Peach <jor...@gmail.com> wrote:
> 
> Hi all,
> 
> In MESOS-8027 and https://reviews.apache.org/r/62638/, I'm claiming that, in 
> practice, we do not have any supported platforms that don't implement 
> O_CLOEXEC to open. All current Linux, FreeBSD and Solaris versions implement 
> O_CLOEXEC. Does anyone know of a platform that doesn't have O_CLOEXEC that we 
> ought to work on?
> 
> https://www.freebsd.org/cgi/man.cgi?sektion=2=open
> http://man7.org/linux/man-pages/man2/open.2.html
> https://docs.oracle.com/cd/E23824_01/html/821-1463/open-2.html
> https://developer.apple.com/legacy/library/documentation/Darwin/Reference/ManPages/man2/open.2.html

Bump! If you run Mesos on a platform that doesn't support O_CLOEXEC (eg. Linux 
kernel <= 2.6.23), please let us know!

J

Re: Collect feedbacks on TASK_FINISHED

2017-09-22 Thread James Peach

> On Sep 21, 2017, at 10:12 PM, Vinod Kone  wrote:
> 
> I think it makes sense for `TASK_KILLED` to be sent in response to a KILL
> call irrespective of the exit status. IIRC, that was the original intention.

Those are the semantics we implement and expect in our scheduler and executor. 
The only time we emit TASK_KILLED is in response to a scheduler kill, and a 
scheduler kill always ends in a TASK_KILLED.

The rationale for this is
1. We want to distinguish whether the task finished for its own reasons (ie. 
not due to a scheduler kill)
2. The scheduler told us to kill the task and we did, so it was TASK_KILLED 
(irrespective of any exit status)

> On Thu, Sep 21, 2017 at 8:20 PM, Qian Zhang  wrote:
> 
>> Hi Folks,
>> 
>> I'd like to collect the feedbacks on the task state TASK_FINISHED.
>> Currently the default and command executor will always send TASK_FINISHED
>> as long as the exit code of task is 0, this cause an issue: when scheduler
>> initiates a kill task, the executor will send SIGTERM to the task first,
>> and if the task handles SIGTERM gracefully and exit with 0, the executor
>> will send TASK_FINISHED for that task, so we will see the task state
>> transition: TASK_KILLING -> TASK_FINISHED.
>> 
>> This seems incorrect because we thought it should be TASK_KILLING ->
>> TASK_KILLED, that's why we filed a ticket MESOS-7975
>>  for it. However, I am
>> not very sure if it is really a bug, because I think it depends on how we
>> define the meaning of TASK_FINISHED, if it means the task is terminated
>> successfully on its own without external interference, then I think it does
>> not make sense for scheduler to receive a TASK_KILLING followed by a
>> TASK_FINISHED since there is indeed an external interference (killing task
>> is initiated by scheduler). However, if TASK_FINISHED means the task is
>> terminated successfully for whatever reason (no matter it is killed or
>> terminated on its own), then I think it is OK to receive a TASK_KILLING
>> followed by a TASK_FINISHED.
>> 
>> Please let us know your thoughts on this issue, thanks!
>> 
>> 
>> Regards,
>> Qian Zhang
>> 



Re: TASK_FAILED - Mesos Container Images

2017-09-06 Thread James Peach

> On Sep 6, 2017, at 4:41 AM, Thodoris Zois  wrote:
> 
> Hello, 
> 
> I am using the Mesos Containerizer with Docker Images. The problem is that 
> whenever a container exits my task gets TASK_FAILED because the container 
> exits with ‘1’.
> My docker file invokes a shell script via CMD /script.sh.
> 
> My protobuff can be found below:
> https://pastebin.com/1agjAFdm
> 
> 
> The agent log can be found here:
> https://pastebin.com/Q6qndVuU
> 
> 
> The stderr and stdout from the UI can be found here:
> stdout: https://pastebin.com/SkDDUDDJ
> 
> stderr: https://pastebin.com/t2TFgQ4G

/execute_blast.sh: line 4: /proc/sys/vm/drop_caches: Read-only file system


Does your task exit when this happens?




Re: Deprecating `--disable-zlib` in libprocess

2017-08-08 Thread James Peach

> On Aug 8, 2017, at 10:57 AM, Chun-Hung Hsiao  wrote:
> 
> Hi all,
> 
> In libprocess, we have an optional `--disable-zlib` flag, but it's
> currently not used
> for conditional compilation and we always use zlib in libprocess,
> and there's a requirement check in Mesos to make sure that zlib exists.
> Should this option be removed then?

Yes.

> Or is there anyone working on a system without zlib?
> 
> Thanks for your opinions!
> Chun-Hung



Re: Command Executor

2017-08-07 Thread James Peach

> On Aug 5, 2017, at 3:03 AM, Oeg Bizz  wrote:
> 
> I have a framework that relies on information sent by a custom Java Command 
> Executor; think of some sort of heartbeat.  I start getting hearbeats after I 
> send a task to that mesos-slave, but never before that.  That makes me assume 
> that the CommandExecutor is not started until a task is submitted to be 
> executed by that agent.  Is there a way to tell mesos-slave to start the 
> ComandExecutor as soon as it starts running?

Not AFAIK. Executors are always spawned in order to execute tasks. In your 
case, what is the heartbeat for, if there are no tasks on the agent?

J

Re: Mesos-docker-executor understanding

2017-07-21 Thread James Peach

> On Jul 19, 2017, at 10:05 AM, Thomas HUMMEL  wrote:
> 
> Hello,
> 
> I've read some books about Mesos, installed one multi-master cluster (for POC 
> purposes) with some frameworks (Marathon, Spark for instance) and watch some 
> talks.
> 
> Everything works and my understanding of Mesos is becoming clearer.
> However, I'm having a hard time fully understanding this section of the 
> documentation :
> 
>  http://mesos.apache.org/documentation/latest/containerizer-internals/
> 
> Note :
> 
> - my understanding is that Mesos is now heading towards a universal 
> containerizer, which is able to understand docker (among others) images thus 
> beeing able to "do some docker" without a docker daemon.


> 
> - Also, I don't think Mesos supports nester containers yet either.

http://mesos.apache.org/documentation/latest/nested-container-and-task-group/

> 
> But I'm not really sure about what is the mesos-docker-executor, as opposed 
> to mesos-executor.
> 
> - For instance, in the "A)" case of the slave running in a container, the doc 
> states :
> 
> "if the task does not include an executor i.e. it defines a command, the 
> default executor mesos-docker-executor is launched in a docker container to 
> execute the command via Docker CLI."
> 
> Why Docker CLI ? We are not in a shell context, are we ?
> Also, will the command be launched in another docker container, different 
> from the one running mesos-docker-executor ?
> 
> - In the B) case where the slave is not running in a container, the doc 
> states :
> 
> "If task does not include an executor i.e. it defines a command, a subprocess 
> is forked to execute the default executor mesos-docker-executor. 
> mesos-docker-executor then spawns a shell to execute the command via Docker 
> CLI."
> 
> Why is the mesos-docker-executor not run in a container ? Also, why does it 
> not use docker API directly ?
> 
> Can you help me figuring out how exactly mesos-docker-executor works and what 
> its specificality relative to the mesos-executor ?

From the scheduler's POV, I'd say you don't really need to know the difference. 
In the API you need to specify the container type to switch between using the 
Docker containerized and the Mesos containerizer. For example, you can 
experiment by passing JSON to the --task option of mesos-execute:

{
  "name": "sleep",
  "agent_id": { "value": "any" },
  "task_id": {
"value": "some-unique-uuid"
  },
  "resources": [
{
  "name": "cpus",
  "type": "SCALAR",
  "scalar": {
"value": 0.4
  },
  "role": "*"
},
{
  "name": "mem",
  "type": "SCALAR",
  "scalar": {
"value": 32
  },
  "role": "*"
}
  ],
  "command": {
"value": "command-line that you want to run",
"environment": {
" variables": [
{ "name": "GLOG_v", "value": "2" }
]
}
  },
  "container": {
"type": "MESOS",
"mesos": {
  "image": {
"type": "DOCKER",
"docker": {
"name": "image/name"
}
  }
}
  }
}



Re: Format for attributes with no value

2017-07-14 Thread James Peach

> On Jul 13, 2017, at 1:41 PM, Jeff Kubina <jeff.kub...@gmail.com> wrote:
> 
> I want to know the format for an empty attribute in the list format for the 
> mesos-slave --attributes option. If I have an attribute, say key2, with no 
> value would it be "mesos-slave --attributes key1:value1;key2;key3:value3" or 
> "mesos-slave --attributes key1:value1;key2:;key3:value3 or does it not matter?

As I said before, I don't think there is a way to have an empty attribute 
value. 

$ sudo /opt/mesos/agent "--attributes=key1:value1;key2:;key3:value3"
...
F0714 06:16:49.332021 16167 attributes.cpp:145] Invalid attribute key:value 
pair 'key2:'

$ sudo /opt/mesos/agent "--attributes=key1:value1;key2;key3:value3"
...
F0714 06:17:58.916723 16297 attributes.cpp:145] Invalid attribute key:value 
pair 'key2'

Please file a bug at https://issues.apache.org/jira/projects/MESOS

> 
> -- 
> Jeff Kubina
> 410-988-4436
> 
> 
> On Tue, Jul 11, 2017 at 5:20 AM, Oeg Bizz <oegb...@yahoo.com> wrote:
> James,
>   If you need an empty attribute as default for mesos, just create an empty 
> file with the '?' in front of it and save it in the /etc/mesos- slave> directory.  For instance, if you want to enable authentication and 
> want to pass the --authenticate attribute then create an empty file called 
> /etc/mesos-master/?authenticate.
> 
> Not sure if that is what you meant with your question,
> 
> Oscar
> 
> 
> On Tuesday, July 11, 2017, 12:53:37 AM EDT, James Peach <jor...@gmail.com> 
> wrote:
> 
> 
> 
> > On Jul 7, 2017, at 4:46 PM, Jeff Kubina <jeff.kub...@gmail.com> wrote:
> > 
> > When setting an attribute with no value of a mesos-agent is the colon 
> > needed, optional, or must it be omitted? It's not clear from the 
> > documentation. For example, which line or lines below are correct?
> > 
> > att1:val1;att2;att3:val3
> > 
> > att1:val1;att2:;att3:val3
> 
> 
> I don't see a way to express an empty attribute at all :(
> 



Re: Format for attributes with no value

2017-07-10 Thread James Peach

> On Jul 7, 2017, at 4:46 PM, Jeff Kubina  wrote:
> 
> When setting an attribute with no value of a mesos-agent is the colon needed, 
> optional, or must it be omitted? It's not clear from the documentation. For 
> example, which line or lines below are correct?
> 
> att1:val1;att2;att3:val3
> 
> att1:val1;att2:;att3:val3

I don't see a way to express an empty attribute at all :(

Re: Dynamic reservations without a principal

2017-07-05 Thread James Peach

> On Jul 4, 2017, at 5:27 PM, Srikanth Viswanathan  wrote:
> 
> Hi folks,
> 
> I am trying to have the Chronos framework consume dynamic reservations in 
> Mesos. However, it appears that Chronos is unable to do this because it does 
> not pass the framework principal to Mesos when launching tasks (See 
> https://github.com/mesos/chronos/issues/843), which makes Mesos reject the 
> launch operation.
> 
> To get around this, I am considering changing my dynamic reservations to be 
> purely role-based instead of (role, principal)-based. Is this allowed/valid? 
> http://mesos.readthedocs.io/en/0.24.1/reservation/#dynamic-reservation-since-0230
>  says "resources are reserved for a role." Does this mean I can make a 
> dynamic reservation just for (role) instead of (role, principal)?

FYI the official documentation is at 
http://mesos.apache.org/documentation/latest/

Re: Framework change role

2017-07-05 Thread James Peach

> On Jul 5, 2017, at 2:39 AM, Thodoris Zois <z...@ics.forth.gr> wrote:
> 
> Ok, probably you are right but what you don’t understand is that i am a 
> completely newbie and i see such systems for the first time. It’s about a 
> university project that i am working on in order to get my bachelor degree. I 
> don’t really know the proper way to express what i want to like you do. I 
> don’t know how to connect 2 frameworks or something, i don’t even know how to 
> make my own framework and compile it with the dependencies that Mesos needs. 
> I am just writting on TestFramework.java and TestExecutor.java and that’s 
> all. The reason that i want to run everything from the same framework 
> (instance) is because keep a TreeMap with some info that i don’t want to lose 
> if i terminate the Schedulers driver. So if i start up a new framework, 
> TreeMap is gone. Forget about the tasks, i can use the same TestExecutor.java 
> for every scheduler.

Note that you can start multiple Scheduler drivers within the same server. Each 
one can register with Mesos as a separate framework.

> What i want to achieve is to get offers from a specific agent with role “SB” 
> and run 10 tasks (i give also to my Framework role “SB”), then i store those 
> information (which are actually TaskInfo in Schedulers TreeMap). After that i 
> would like to change the role of my Framework to “*(default)”. Then i will 
> get offer from an agent that has the default role and i will still have the 
> info in my TreeMap because scheduler instance didn’t stop.

If you start multiple Scheduler drivers, they can all share a TreeMap because 
they are running in the same Java process.

> That’s all. My problem is that i don’t know how to change the role of the 
> Framework without losing that TreeMap, and also how to set it with version 
> 1.3.0.  
> 
> Hope that everybody understands now….
> Thank you, and i am really sorry for the spam
> 
> 
>> On 5 Jul 2017, at 12:24, James Peach <jor...@gmail.com> wrote:
>> 
>> 
>>> On Jul 5, 2017, at 12:54 AM, Thodoris Zois <z...@ics.forth.gr> wrote:
>>> 
>>> Hi, 
>>> 
>>> No, i would like my framework to be offered resources from agent with role 
>>> (e.g: thz) and after running the specific tasks change its role to (*) in 
>>> order to get offers from different agents, but it will run the same tasks 
>>> because i am never terminating the scheduler driver (that’s what i want to).
>> 
>> As I suggested on Slack, I still think the most obvious way to implement 
>> this is to connect 2 frameworks, 1 for each role. Just co-ordinate 
>> internally to accept the offers you want in the right sequence. From your 
>> description, there's no requirement for this to be done in 1 framework.
>> 
>> I don't really follow what you mean by "run the same tasks". You can run new 
>> instances of the same task (from whatever framework you like); you can also 
>> send new tasks to an existing executor (from the same framework).
>> 
>>> Don’t try to find logic, it’s not for a company or something :)
>> 
>> I think that for people to help you, they need to be able to understand the 
>> logic of what you want to achieve and why :)
>> 
>>> 
>>> Thank you,
>>> Thodoris
>>> 
>>>> On 5 Jul 2017, at 05:36, Jay Guo <guojiannan1...@gmail.com> wrote:
>>>> 
>>>> Hi Thodoris,
>>>> 
>>>> If I understand correctly, you would like your framework to receive offers 
>>>> from both 'role' and '*', so resources reserved to 'role' on particular 
>>>> agent could be reliably supplied to the framework? Isn't it sufficient to 
>>>> start your framework with multiple roles, 'role' & '*'? You need to enable 
>>>> the capability.
>>>> 
>>>> - J
>>>> 
>>>> On Wed, Jul 5, 2017 at 7:28 AM, Thodoris Zois <z...@ics.forth.gr> wrote:
>>>> I have built a Framework in Java that is running certain tasks. I would 
>>>> like to run those tasks on a specific agent. I have set a role to the 
>>>> Framework and used flags upon starting of the agent. Till here everything 
>>>> is good. When framework has run tasks successfully i am not terminating 
>>>> it. I would like to change its role to default (*) and get offered 
>>>> resources from master that correspond to that role and it will run again 
>>>> the same amount of tasks (and the same tasks) because i never terminated 
>>>> (and i don't want to terminate its instance because i keep some mesos 
>>>> metrics to

Re: Framework change role

2017-07-05 Thread James Peach

> On Jul 5, 2017, at 12:54 AM, Thodoris Zois  wrote:
> 
> Hi, 
> 
> No, i would like my framework to be offered resources from agent with role 
> (e.g: thz) and after running the specific tasks change its role to (*) in 
> order to get offers from different agents, but it will run the same tasks 
> because i am never terminating the scheduler driver (that’s what i want to).

As I suggested on Slack, I still think the most obvious way to implement this 
is to connect 2 frameworks, 1 for each role. Just co-ordinate internally to 
accept the offers you want in the right sequence. From your description, 
there's no requirement for this to be done in 1 framework.

I don't really follow what you mean by "run the same tasks". You can run new 
instances of the same task (from whatever framework you like); you can also 
send new tasks to an existing executor (from the same framework).

> Don’t try to find logic, it’s not for a company or something :)

I think that for people to help you, they need to be able to understand the 
logic of what you want to achieve and why :)

> 
> Thank you,
> Thodoris
> 
>> On 5 Jul 2017, at 05:36, Jay Guo  wrote:
>> 
>> Hi Thodoris,
>> 
>> If I understand correctly, you would like your framework to receive offers 
>> from both 'role' and '*', so resources reserved to 'role' on particular 
>> agent could be reliably supplied to the framework? Isn't it sufficient to 
>> start your framework with multiple roles, 'role' & '*'? You need to enable 
>> the capability.
>> 
>> - J
>> 
>> On Wed, Jul 5, 2017 at 7:28 AM, Thodoris Zois  wrote:
>> I have built a Framework in Java that is running certain tasks. I would like 
>> to run those tasks on a specific agent. I have set a role to the Framework 
>> and used flags upon starting of the agent. Till here everything is good. 
>> When framework has run tasks successfully i am not terminating it. I would 
>> like to change its role to default (*) and get offered resources from master 
>> that correspond to that role and it will run again the same amount of tasks 
>> (and the same tasks) because i never terminated (and i don't want to 
>> terminate its instance because i keep some mesos metrics to a static 
>> TreeMap). That's all.. I just wanted somebody to explain me exactly how it 
>> works and what i have to do because everything i have tried today fails, and 
>> seems i can't find useful info on the Internet about this. 
>> 
>> Thank you!
>> 
>> On 4 Jul 2017, at 21:00, Michael Park  wrote:
>> 
>>> What is it that you need help with?
>>> 
>>> On Tue, Jul 4, 2017 at 11:12 AM Thodoris Zois  wrote:
>>> Hello list,
>>> 
>>> Is anybody available to help me with the new feature of 1.3.0 version, that 
>>> a framework can modify its role?
>>> 
>>> Thank you
>> 
> 



Re: ensuring a particular task is deployed to "all" Mesos Worker hosts

2017-07-01 Thread James Peach

> On Jul 1, 2017, at 11:14 AM, Erik Weathers  wrote:
> 
> Thanks for the info Kevin.  Seems there's no JIRAs nor design docs floating 
> about yet for "admin tasks" or "daemon sets".
> 
> Just FYI, this is the ticket in Storm for the problem I've been mentioning:
> 
> https://issues.apache.org/jira/browse/STORM-1342
> 
> I'll update it with the info you've provided below, so for now we'll rely on 
> manually deploying logviewers.

ISTM that this should be possible with a smart framework. If the framework 
keeps track of which agents it gets offers for, it could ensure that it 
launches a Storm logviewer task on the agent before launching any other Storm 
tasks. I expect that it might be a little tricky to get the containerization 
right so that the Storm tasks can rendezvous with the logviewer, but in 
principle it could be made to work?

> 
> Thanks!
> 
> - Erik
> 
> On Sat, Jul 1, 2017 at 10:09 AM Kevin Klues  wrote:
> What you are describing is a feature we call 'admin tasks' or 'daemon sets'. 
> 
> Unfortunately, there is no direct support for these yet, but we do have plans 
> in the (relatively) near future to start working on it.
> 
> One of our use cases is exactly what you describe with the logging service. 
> On DC/OS we currently run our logging service as a systemd unit outside of 
> mesos since we can't guarantee it gets launched everywhere (the same is true 
> for a bunch of other services as well, namely metrics).
> 
> We don't have an exact timeline for when we will build this support yet, but 
> we will certainly announce it once we start actively working on it.
> 
> 
> Erik Weathers  schrieb am Sa. 1. Juli 2017 um 09:45:
> That works for our particular use case, and is effectively what *we* do, but 
> renders storm a "strange bird" amongst mesos frameworks.  Is there no 
> trickery that could be played with mesos roles and/or reservations?
> 
> - Erik
> 
> On Sat, Jul 1, 2017 at 3:57 AM Dick Davies  wrote:
> If it _needs_ to be there always then I'd roll it out with whatever
> automation you use to deploy the mesos workers ; depending on
> the scale you're running at launching it as a task is likely to be less
> reliable due to outages etc.
> 
> ( I understand the 'maybe all hosts' constraint but if it's 'up to one per
> host', it sounds like a CM issue to me. )
> 
> On 30 June 2017 at 23:58, Erik Weathers  wrote:
> > hi Mesos folks!
> >
> > My team is largely responsible for maintaining the Storm-on-Mesos framework.
> > It suffers from a problem related to log retrieval:  Storm has a process
> > called the "logviewer" that is assumed to exist on every host, and the Storm
> > UI provides links to contact this process to download logs (and other
> > debugging artifacts).   Our team manually runs this process on each Mesos
> > host, but it would be nice to launch it automatically onto any Mesos host
> > where Storm work gets allocated. [0]
> >
> > I have read that Mesos has added support for Kubernetes-esque "pods" as of
> > version 1.1.0, but that feature seems somewhat insufficient for implementing
> > our desired behavior from my naive understanding.  Specifically, Storm only
> > has support for connecting to 1 logviewer per host, so unless pods can have
> > separate containers inside each pod [1], and also dynamically change the set
> > of executors and tasks inside of the pod [2], then I don't see how we'd be
> > able to use them.
> >
> > Is there any existing feature in Mesos that might help us accomplish our
> > goal?  Or any upcoming features?
> >
> > Thanks!!
> >
> > - Erik
> >
> > [0] Thus the "all" in quotes in the subject of this email, because it
> > *might* be all hosts, but it definitely would be all hosts where Storm gets
> > work assigned.
> >
> > [1] The Storm-on-Mesos framework leverages separate containers for each
> > topology's Supervisor and Worker processes, to provide isolation between
> > topologies.
> >
> > [2] The assignment of Storm Supervisors (a Mesos Executor) + Storm Workers
> > (a Mesos Task) onto hosts is ever changing in a given instance of a
> > Storm-on-Mesos framework.  i.e., as topologies get launched and die, or have
> > their worker processes die, the processes are dynamically distributed to the
> > various Mesos Worker hosts.  So existing containers often have more tasks
> > assigned into them (thus growing their footprint) or removed from them (thus
> > shrinking the footprint).
> -- 
> ~Kevin



Re: Mesos-Metrics per task

2017-06-29 Thread James Peach

> On Jun 29, 2017, at 3:53 PM, Thodoris Zois  wrote:
> 
> Hello, i would like to get some metrics per task. E.g memory/cpu usage is 
> there any way? 
> 
> Thank you! 

You can use the GET_CONTAINERS agent API call 
 to get 
resource usage for a container, then match up the container to a task by using 
other master and agent APIs to match the framework ID and executor ID.

J

Re: Agent Working Directory Best Practices

2017-06-26 Thread James Peach

> On Jun 26, 2017, at 4:05 PM, Steven Schlansker  
> wrote:
> 
> 
>> On Jun 25, 2017, at 11:24 PM, Benjamin Mahler  wrote:
>> 
>> As a data point, as far as I'm aware, most users are using a local work 
>> directory, not an NFS mounted one. Would love to hear from anyone on the 
>> list if they are doing this, and if there are any subtleties that should be 
>> documented.
> 
> We don't run NFS in particular but we did originally use a SAN -- two 
> observations:
> 
> NFS (historically, maybe it's better now, but doubtful...) has really bad 
> failure modes.
> Network failures can cause serious hangs both in user-space and kernel-space. 
>  Such
> hangs can be impossible to clear without rebooting the machine, and in some 
> edge cases
> can even make it difficult or impossible to reboot the machine via normal 
> means.

You need to make sure to mount with the "intr" option.

https://speakerdeck.com/gnb/130-lca2008-nfs-tuning-secrets-d7

> 
> Network attached drives (our SAN) are less reliable, slower, and more complex
> (read: more failure modes) than local disk.  It's also a really big single 
> point
> of failure.  So far our only true cluster outages have been due to failure of
> the SAN, since it took down all nodes at once -- once we removed the SAN, 
> future
> failures had islands of availability and any properly written application
> could continue running (obviously without network resources) through the 
> incident.
> 
> Maybe this isn't a huge deal for your use case, which might differ from ours.
> For us, it was enough of a problem that we now purchase local SSD scratch 
> space
> for every node just so that we have some storage we can depend on a bit more
> than network attached storage.
> 
>> 
>> On Thu, Jun 22, 2017 at 11:13 PM,  wrote:
>> Hi,
>> 
>> We have a couple of server nodes mainly used for computational tasks in
>> our mesos cluster. These servers have beefy cpus, gpus etc. but only
>> limited ssd space. We also have a 40GBe network and a decently fast
>> file server.
>> 
>> My question is simple but I didnt find an answer anywhere: What are the
>> best practices for the working directory on mesos-agent nodes? Should
>> we keep the working directory local or is it reasonable to use a nfs
>> mounted folder? We implemented both and they seem to work fine, but I
>> would rather like to follow "best practices".
>> 
>> Thanks and cheers
>> 
>> Tom
>> 
> 



Re: Work group on Community

2017-06-16 Thread James Peach

> On Jun 15, 2017, at 10:57 AM, Vinod Kone  wrote:
> 
> Hi folks,
> 
> Seeing that our first official containerizer WG is off to a good start, we
> want to use that momentum to start new WGs.
> 
> I'm proposing that we start a new work group on community. The mission of
> this work group would be to figure out ways to grow the size of our
> community and improve the experience of community members (users, devs,
> contributors, committers etc).
> 
> In the first meeting, we can nail down what the charter of this work group
> should be etc. My initial ideas for the topics/components this work group
> could cover
> 
> --> Releases
> --> Roadmap
> --> Reviews
> --> JIRA
> --> CI
> 
> Over time, I'm hoping that new specific work groups will sprung up that can
> own some of these topics.
> 
> If you are interested in joining this work group, please reply to this
> thread and I'll add you to the invite.

I'm interested, but unlikely to have much bandwidth to contribute anything 
substantial. One suggestion I have is that a Mesos Weekly news would be pretty 
great. There is a lot of activity on reviewboard, slack and in design documents 
and collecting that in a regular newsletter would give that activity a lot more 
visibility.

J

Re: How to filter GET_TASKS api result

2017-04-19 Thread James Peach

> On Apr 19, 2017, at 5:00 PM, Benjamin Mahler  wrote:
> 
> We can add a Call.GetTasks message to allow you to specify which task ids you 
> would like to retrieve. But this isn't supported yet, the code needs to be 
> written. E.g.
> 
> message Call {
>   enum Type {
> GET_TASKS = 13; // Retrieves the information about tasks, see 
> `GetTasks` below.
>   }
> 
>   message GetTasks {
> // Which tasks to retrieve, leave empty to retrieve all tasks.
> repeated TaskID task_ids;
>   }
> }

See also https://issues.apache.org/jira/browse/MESOS-6935. It makes sense to be 
able to ask for specific FrameworkIDs too.

> 
> On Thu, Apr 6, 2017 at 8:31 PM, 梦开始的地方 <382607...@qq.com> wrote:
> 
> but spark and chronos has too many short tasks,get all task is too slow.
> 
> -- 原始邮件 --
> 发件人: "Alexander Rojas";;
> 发送时间: 2017年4月3日(星期一) 晚上9:47
> 收件人: "user";
> 主题: Re: How to filter GET_TASKS api result
> 
> Hi,
> 
> Mesos does not have a way to get info about a single task, however the answer 
> should be pretty easy to filter so you can search for the task you’re looking 
> for.
> 
> Alexander Rojas
> alexan...@mesosphere.io
> 
> 
> 
> 
>> On 20 Mar 2017, at 10:35, 梦开始的地方 <382607...@qq.com> wrote:
>> 
>> Hi,I'd like to use the GET_TASKS api get specific task ,but the api return 
>> all tasks.
>> please help me,thanks
>> 
> 
> 



Re: Structured logging for Mesos (or c++ glog)

2016-12-19 Thread James Peach

> On Dec 19, 2016, at 2:54 PM, Zhitao Li  wrote:
> 
> Hi James,
> 
> Stitching events together is only one possible use cases, and I'm not exactly 
> sure what you meant by directly event logging.
> 
> Taking the hierarchical allocator for example. In a multi-framework cluster, 
> sometimes I want to comb through various loggings and present a trace on how 
> allocation has affected a particular framework (by its framework id) and/or 
> w.r.t an agent (by its agent id).
> 
> Being able to systematically extract structured field values like 
> framework_id or agent_id, regardless of the actually logging pattern, will be 
> tremendously automatically from all lo valuable in such use cases.

I think we are talking about similar things. Many servers do both free-form 
error logging and structured event logging. I'm thinking of event logging 
formats are customizable by the operator and allow the interpolation of 
context-specific data item (eg. HTTP access logs from many different server 
implementations).

J

Re: Persistent volume ownership issue

2016-06-21 Thread James Peach
On 21 June 2016 at 12:25, Jie Yu <yujie@gmail.com> wrote:
> James, sticky bit means that there will be no write sharing between two
> users even if the underlying permission allows it. I'd prefer not having
> this restriction:)

No, it just prevents users renaming or deleting each others files.

http://man7.org/linux/man-pages/man1/chmod.1.html

If you want multiple users to be able to write to the same files, they
need to create with the right ownership.

>> I wonder whether ACLs are the right solution to volume ownership?
>> Certainly I think inherited ACLs are a good solution for expressing a
>> consistent access control policy over a hierarchy (at least in the
>> Windows/Darwin/SMB/NFSv4/RichAcl ACL model).
>
>
> Are you suggesting that we don't expose the underlying unix user directly to
> frameworks. Instead, expressing permissions and ownerships using ACLs?

Well that could be an option, though I'm mainly thinking out loud.
With shared volumes, it seems like you really want an access control
policy that applies to the volume, rather than requiring processes to
collaborate at a file granularity. One way to do that would be to make
the owner the creator of the volume, then use ACL inheritance to grant
additional access to other users. You'd have to reflow the
inheritance, but it could probably done.

-- 
James Peach | jor...@gmail.com


Re: Persistent volume ownership issue

2016-06-21 Thread James Peach
Non-recursive chown is an improvement over recursive chown which seems
fraught and should be avoided. For an interim fix, could you make the
volume root world writeable with the sticky bit set? Then you wouldn't
have to chown and volume users would still be able to create files.

I wonder whether ACLs are the right solution to volume ownership?
Certainly I think inherited ACLs are a good solution for expressing a
consistent access control policy over a hierarchy (at least in the
Windows/Darwin/SMB/NFSv4/RichAcl ACL model).

On 20 June 2016 at 23:25, Jie Yu <yujie@gmail.com> wrote:
> Hi folks,
>
> Currently, the ownership of the persistent volumes are set to be the same as
> the sandbox. In the implementation, we call `chown -R` on the persistent
> volume to match that of the sandbox each time before we mount it into the
> container.
>
> Recently, we realized that this behavior is not ideal. Especially, if a task
> created some files in the persistent volume, and the owner of those file
> might be different than the task's user. For instance, a task is running
> under root and it creates some database files under user 'database' and
> launch the database process under user 'database'. When the database process
> is restarted by the scheduler, the current behavior is that the we'll do a
> 'chown -R root.root' on the persistent volume, causes database files to be
> chown to 'root'.
>
> The true fix of this problem is to allow frameworks to explicit specify
> owner of persistent volumes during creation. THis is captured in this
> ticket:
> https://issues.apache.org/jira/browse/MESOS-4893
>
> In the short-term (for 1.0), I propose that, instead of doing a recursive
> chown, we do a non-recursive chown. That'll allow the new task to at least
> create new files under the persistent volume, but do not change ownership of
> files created by previous tasks. It should be a very simple fix which we can
> ship in 1.0. We'll ship MESOS-4893 after 1.0. What do you guys think?
>
> Thanks,
> - Jie



-- 
James Peach | jor...@gmail.com


Re: How is the OS X environment created with Mesos

2016-05-18 Thread James Peach
This probably boils down to not being in the right launchd session.
launchd(8) discusses this at a high level. You can see what is going
on in your user session with "launchctl print user/$(id -u)".

I'm not sure what the right mechanics ought to be for Mesos. It used
to be that you would use the "bsexec" subcommand to run something in a
different session, but that is deprecated and I don't see an obvious
replacement in the new subcommands. Maybe worth asking on the
launchd-dev mailing list ...


On 11 May 2016 at 12:10, DiGiorgio, Mr. Rinaldo S. <rdigior...@pace.edu> wrote:
>
> On May 5, 2016, at 13:28, haosdent <haosd...@gmail.com> wrote:
>
>>There is no explicit statement about what Mesos means when it runs a task
>> as some other user.
> I think this is just ensure the running user of the task is the user you
> given. In Mesos, it jus call the [setuid](http://linux.die.net/man/2/setuid)
> to change the user, It would not execute something like the bashrc script of
> user.
>
>
> I have been unable to solve this problem for the last few days. I am
> wondering if you have any ideas.
>
>
>
> When Mesos starts a task on an OSX machine, the task is run with setuid to
> the user I have asked for.  When that user runs I cannot get that user to
> have a default login keychain.  I want to initialize the environment so that
> user has something that looks like this.
>
>  existinguser$ security login-keychain
>
>
>  "/Users/rinaldo/Library/Keychains/login.keychain”
>
>
> I have tried many options to create the above keychain for the other user
> that is running in a process that was created by mesos and changed to that
> user with setuid.
>
> I understand that is likely not a Mesos issue. I am hoping someone on this
> alias has come across this issue or something similar.  I have tried the
> following and they have all failed.
>
> su -c   as existinguser
>
> /bin/login as existinguser
>
> OSX is not Open Source so it is difficult to understand what it is they do
> to create a user environment.  The “security” application has many options
> to create keychains but when I use those options the Keychains endup in
>
>
> "/Library/Keychains/System.keychain"
>
>"/Library/Keychains/System.keychain”
>
>
>   I have no investigated how a user is able to create a keychain in the
> System.keychain when running as a user in a Mesos created process.
>
>
> Rinaldo
>
>
>
>
>
> On Thu, May 5, 2016 at 7:41 PM, DiGiorgio, Mr. Rinaldo S.
> <rdigior...@pace.edu> wrote:
>>
>> Hi,
>>
>> Recently I noticed that the Mesos Jenkins plugin supports the
>> setting of environment variables. Somewhere between 0.26 and 0.28.1,
>> settings like
>>
>> USER=
>> HOME=
>>
>> were required to get things to work the way they had worked. I
>> have been able to set the environment this way but I have some concerns
>> about it.
>>
>> There is no explicit statement about what Mesos means when it runs
>> a task as some other user.  Clearly it is not running some of the scripts
>> normally run during login.  This was a constant source of confusion with
>> Jenkins. If one can state what exactly is done to create the user
>> environment each platform and how it is different that others it will save
>> countless hours of debugging IMO. I realize OSX is an odd system -- linux at
>> times, Apple specific at times in areas that conflict with Linux but this
>> will only get more complicated when Windows agents become available.
>>
>>
>>
>> Rinaldo
>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>



-- 
James Peach | jor...@gmail.com


Re: [Proposal] Remove the default value for agent work_dir

2016-04-12 Thread James Peach

> On Apr 12, 2016, at 3:58 PM, Greg Mann  wrote:
> 
> Hey folks!
> A number of situations have arisen in which the default value of the Mesos 
> agent `--work_dir` flag (/tmp/mesos) has caused problems on systems in which 
> the automatic cleanup of '/tmp' deletes agent metadata. To resolve this, we 
> would like to eliminate the default value of the agent `--work_dir` flag. You 
> can find the relevant JIRA here.
> 
> We considered simply changing the default value to a more appropriate 
> location, but decided against this because the expected filesystem structure 
> varies from platform to platform, and because it isn't guaranteed that the 
> Mesos agent would have access to the default path on a particular platform.
> 
> Eliminating the default `--work_dir` value means that the agent would exit 
> immediately if the flag is not provided, whereas currently it launches 
> successfully in this case. This will break existing infrastructure which 
> relies on launching the Mesos agent without specifying the work directory. I 
> believe this is an acceptable change because '/tmp/mesos' is not a suitable 
> location for the agent work directory except for short-term local testing, 
> and any production scenario that is currently using this location should be 
> altered immediately.

+1 from me too. Defaulting to /tmp just helps people shoot themselves in the 
foot.

J

Re: verbose logging with the docker executor

2016-03-19 Thread James Peach

> On Mar 17, 2016, at 10:09 AM, Clarke, Trevor  wrote:
> 
> Looking in the docker executor, the docker command line is logged with 
> VLOG(1) but I'm not sure how to generate that level of log output. Some 
> googling suggests it's used in the google logging library and verbose logging 
> would be enabled with something like --v=1 but that's not a valid mesos-slave 
> option. Can someone point me in the right direction? (currently using 0.24.1)

You can set the GLOG_v environment variable (see 
https://google-glog.googlecode.com/svn/trunk/doc/glog.html#verbose) to the 
desired verbosity level and then restart mesos-slave. If you just want to 
increase the log level without a restart, you can hit the /logging/toggle 
endpoint on the mesos-slave (do curl http://127.0.0.1:5051/help/logging/toggle 
for the online help).

J

Re: OS X build

2015-09-27 Thread James Peach

> On Sep 27, 2015, at 4:15 PM, Vaibhav Khanduja <vaibhavkhand...@gmail.com> 
> wrote:
> 
> Probably yes, 
> 
> The issue which I am pointing out is with the configure script not accepting 
> option "—with-arp"

OK then I'm confused because there is a --with-apr option, and it works AFAICT

jpeach$ ./configure --help | grep apr
  --with-apr=[=DIR]   specify where to locate the apr-1 library

> 
> On Sat, Sep 26, 2015 at 9:26 PM, James Peach <jor...@gmail.com> wrote:
> 
> > On Sep 26, 2015, at 12:01 PM, Vaibhav Khanduja <vaibhavkhand...@gmail.com> 
> > wrote:
> >
> > I am running into issues with build on my MAC - OSX … the configure scripts 
> > complaints about libapr-1 not present. I was able to find a workaround by 
> > passing configure with —with-apr option. Looks like the script checks for 
> > variable to be valid shell variable, if not it is rejected.
> >
> > I was able workaround with having quotes and missing a “-“ for the variable
> >
> >  ../configure -disable-python --disable-java 
> > "-with-apr=/usr/local/Cellar/apr/1.5.2/libexec/“
> >
> > the configure —help though suggests to use —with-apr
> 
> AFAICT you are supposed to use apr-1-config to fish out the libapr path when 
> using Homebrew. I think that is would be reasonable for the Mesos build to 
> just automatically use apr-1-config if it is present.
> 
> ./configure --with-apr=$(apr-1-config --prefix)
> 
> >
> >
> > am I missing something here?
> >
> > Thx
> 
> 



Re: OS X build

2015-09-26 Thread James Peach

> On Sep 26, 2015, at 12:01 PM, Vaibhav Khanduja  
> wrote:
> 
> I am running into issues with build on my MAC - OSX … the configure scripts 
> complaints about libapr-1 not present. I was able to find a workaround by 
> passing configure with —with-apr option. Looks like the script checks for 
> variable to be valid shell variable, if not it is rejected.
> 
> I was able workaround with having quotes and missing a “-“ for the variable
> 
>  ../configure -disable-python --disable-java 
> "-with-apr=/usr/local/Cellar/apr/1.5.2/libexec/“
> 
> the configure —help though suggests to use —with-apr

AFAICT you are supposed to use apr-1-config to fish out the libapr path when 
using Homebrew. I think that is would be reasonable for the Mesos build to just 
automatically use apr-1-config if it is present.

./configure --with-apr=$(apr-1-config --prefix)

>  
> 
> am I missing something here?
> 
> Thx



Re: Building portable binaries

2015-09-17 Thread James Peach

> On Sep 17, 2015, at 4:33 PM, F21  wrote:
> 
> Is there anyway to build portable binaries for mesos?
> 
> Currently, I have tried building my own libsvn, libsasl2, libcurl, libapr and 
> then built mesos using the following:
> 
> ../configure CC=gcc-4.8 CXX=g++-4.8 
> LD_LIBRARY_PATH=/tmp/mesos-build/sasl2/lib 
> SASL_PATH=/tmp/mesos-build/sasl2/lib/sasl2 --prefix=/tmp/mesos-build/mesos 
> --with-svn=/tmp/mesos-build/svn --with-apr=/tmp/mesos-build/apr 
> --with-sasl=/tmp/mesos-build/sasl2/ --with-curl=/tmp/mesos-build/curl
> make
> make install
> 
> I then compress /tmp/mesos-build/mesos into an archive and distribute it to 
> my machines. The only problem is that the build seems to be buggy. For 
> example, I've been experiencing a containerization issues where the executors 
> will crash, but not output anything useful to stderr and stdout. See 
> https://github.com/mesosphere/hdfs/issues/194
> 
> Is there a definite way to build portable binaries that I can easily copy to 
> another machine to run?

You could do a statically linked build by doing configure --enable-static 
--disable-shared. I don't know whether that is supported in the Mesos build, 
but it is a standard automake feature, so if it fails it should be fixable.

J



Re: Recommended way to discover current master

2015-08-31 Thread James Peach

> On Aug 31, 2015, at 10:25 AM, Philip Weaver  wrote:
> 
> My framework knows the list of zookeeper hosts and the list of mesos master 
> hosts.
> 
> I can think of a few ways for the framework to figure out which host is the 
> current master. What would be the best? Should I check in zookeeper directly? 
> Does the mesos library expose an interface to discover the master from 
> zookeeper or otherwise? Should I just try each possible master until one 
> responds?

If you want to do it the HTTP way, just hit the /master/redirect enndpoint on 
any master that you can reach.

> 
> Apologies if this is already well documented, but I wasn't able to find it. 
> Thanks!
> 
> - Philip
> 



Re: Build 0.23 gcc Version

2015-07-29 Thread James Peach

 On Jul 28, 2015, at 6:45 AM, John Omernik j...@omernik.com wrote:
 
 So, I don't mean to sound like a newbie here, but in running my current setup 
 which has 4.6.3, (and I tried to run 4.8) how can I get Mesos 0.23 to 
 compile. Is this something I need to change in certain files? In certain 
 steps? Is this something that should be a bug in Mesos to handle the 
 versions? Is this a configuration issue? I'd love to learn more about how 
 this works, but would love some pointers here, and since my setup is fairly 
 vanilla, others may also benefit from getting this to work.

AFAIK mesos requires gcc = 4.8. You can force a specific compiler by setting 
the CC and CXX variables to configure, eg. ./configure CC=gcc-4.8 CXX=g++-4.8. 
In your previous message, it looked like configure was using cached values for 
the compiler check. If it still does that try removing config.cache.

  
 
 John
 
 On Mon, Jul 27, 2015 at 10:56 AM, James Peach jor...@gmail.com wrote:
 
  On Jul 24, 2015, at 3:57 PM, Michael Park mcyp...@gmail.com wrote:
 
  Hi John,
 
  I would first suggest trying CC=gcc CXX=g++ ../configure, and if that 
  works, try to find out what which cc and which c++ return and find out what 
  they symlink to.
  I believe autotools uses cc and c++ rather than gcc and g++ by default, so 
  I think there's probably something funky going on there.
 
 No, you explicitly tell autoconf to default to G++
 
 mesos.git jpeach$ grep AC_PROG_C configure.ac
 AC_PROG_CXX([g++])
 AC_PROG_CC([gcc])
 
 IMHO the correct invocation is something like:
 AC_PROG_CXX([c++ g++ clang++])
 
 since you should always default to the system default toolchain
 
 J
 
 



Re: Build 0.23 gcc Version

2015-07-27 Thread James Peach

 On Jul 24, 2015, at 3:57 PM, Michael Park mcyp...@gmail.com wrote:
 
 Hi John,
 
 I would first suggest trying CC=gcc CXX=g++ ../configure, and if that 
 works, try to find out what which cc and which c++ return and find out what 
 they symlink to.
 I believe autotools uses cc and c++ rather than gcc and g++ by default, so I 
 think there's probably something funky going on there.

No, you explicitly tell autoconf to default to G++ 

mesos.git jpeach$ grep AC_PROG_C configure.ac
AC_PROG_CXX([g++])
AC_PROG_CC([gcc])

IMHO the correct invocation is something like:
AC_PROG_CXX([c++ g++ clang++])

since you should always default to the system default toolchain

J