Re: hostname in task

2019-08-03 Thread James Peach



> On Aug 3, 2019, at 10:59 PM, Marc Roos  wrote:
> 
> 
> I read you can add a hostname option to the container in this issue[0], 
> however I still have the uuid. Is this in available in mesos 1.8?

Yep.

> Can I 
> somewhere read all these options? Like here[1]

The Mesos API is defined in the ContainerInfo protobuf, but I’m not sure how 
marathon maps that

https://github.com/apache/mesos/blob/master/include/mesos/v1/mesos.proto#L3395


> 
> 
> [@ cni]# cat 2f261fa8-4985-4614-b712-f0785ca6ce04/hosts
> 127.0.0.1 localhost
> 192.168.123.32 2f261fa8-4985-4614-b712-f0785ca6ce04
> 
> [0]
> https://reviews.apache.org/r/55191/
> [1]
> http://mesosphere.github.io/marathon/api-console/index.html
> 
> Using mesos 1.8
> And
> 
> "container": {
>"type": "MESOS",
>"hostname": "test.example.com",
>"docker": {
>"image": "test",
>"credential": null,
>"forcePullImage": true
>},
>   "volumes": [
>  {
>  "mode": "RW",
>  "containerPath": "/dev/log",
>  "hostPath": "/dev/log" 
>  }
>  ]
>  },



Re: On adding a debug endpoint for Mesos containerizer

2019-06-05 Thread James Peach
I really like this proposal and I think that it would help opertional teams a 
lot. Let’s make sure that it is well documented :)

> On Jun 5, 2019, at 1:05 AM, Andrei Budnik  wrote:
> 
> Hi folks,
> 
> We have been encountering container stuck issues for quite a long time. Some 
> of these issues are caused by external components such as CNI/CSI plugins, 
> custom Mesos modules, etc. Also, there were cases when a container become 
> stuck due to a Linux kernel bug. All these kinds of issues make it difficult 
> to debug container stuck issues.
> 
> We are proposing a container debug endpoint for the Mesos agent [1], which is 
> based on a new mechanism for tracking pending libprocess futures [2].
> 
> Please review both of them.
> 
> [1] Container debug endpoint: 
> https://docs.google.com/document/d/1VtlKD6b8a22HzSdaJUeI7cPGuKd01vLwBJT4XfkeUDI
> [2] Tracking libprocess futures: 
> https://docs.google.com/document/d/1Unu2pe0dRq3Z6XQ5S8lWZm2cU2REjfkUj0xk2ePQ0MY



Re: ssl mesos-executor not using /etc/default/mesos

2019-02-18 Thread James Peach



> On Feb 16, 2019, at 9:46 AM, Marc Roos  wrote:
> 
> 
> 
> Looks like the mesos-executor is not using /etc/default/mesos 
> environment variables

Depending on your configuration, the executor runs inside the container, which 
means that /etc/default/mesos is probably not available. 

> 
> If I export the variables in /etc/default/mesos manually, I can run the 
> task. 
> 
> mesos-execute --master=x.x.x.x:5050 --principal=xxx --secret=xxx 
> --name=ls --command="ls -lRrt /*; sleep 60" --env=file:///test/env.json
> 
> How should this be resolved? I tried setting 
> --executor_environment_variables=/etc/mesos/executor-env.json 

And what happened when you did this?

Re: Discussion: Scheduler API for Operation Reconciliation

2019-01-24 Thread James DeFelice
I've attempted to implement support for operation status reconciliation in
a framework that I've been building. Option (III) seems most convenient
from my perspective as well. A single source of updates:

(a) Leads to a cleaner framework design; I've had to poke a few holes in
the framework's initial design to deal with multiple event sources, leading
to increased complexity.

(b) Allows frameworks to consume events in the order they arrive (and
pushes the responsibility for event ordering back to Mesos). Multiple event
sources that the framework needs to (possibly) reorder based on a timestamp
would add further complexity that we should avoid pushing onto framework
writers.

Some other thoughts:

(c) I've implemented a background polling loop for exactly the reason that
Benno pointed out. An asychronous API call for operation status
reconciliation would be fine with me.

(d) API consistency is important. Framework devs are used to the way that
the task status reconciliation API works, and have come up with solutions
for dealing with the lack of boundaries for streams of explicit
reconciliation events. The synchronous response defined for the currently
published operation status reconciliation call isn't consistent with the
rest of the v1 scheduler API, which generated a bit of extra work (for me)
in the low-level mesos v1 http client lib. Consistency should be a primary
goal when extending existing API sets.

(e) There are probably other ways to solve the problem of a "lack of
boundaries within the event stream" for explicit reconciliation requests.
If this is this a problem that other framework devs need solved then let's
address it as a separate issue - and aim to resolve it in a consistent way
for both task and operation status event streams.

(f) It sounds like option (III) would let Mesos send back smarter operation
statuses in agent/RP failover cases (UNREACHABLE vs. UNKNOWN). Anything to
limit the number of scenarios where UNKNOWN is returned to frameworks
sounds good to me.

-James



On Wed, Jan 16, 2019 at 4:15 PM Benjamin Bannier <
benjamin.bann...@mesosphere.io> wrote:

> Hi,
>
> have we reached a conclusion here?
>
> From the Mesos side of things I would be strongly in favor of proposal
> (III). This is not only consistent with what we do with task status
> updates, but also would allow us to provide improved operation status
> (e.g., `OPERATION_UNREACHABLE` instead of just `OPERATION_UNKNOWN` to
> better distinguish non-terminal from terminal operation states. To
> accomplish that we wouldn’t need to introduce extra information leakage
> (e.g., explicitly keeping master up to date on local resource provider
> state and associated internal consistency complications).
>
> This approach should also simplify framework development as a framework
> would only need to watch a single channel to see operation status updates
> (no need to reconcile different information sources). The benefits of
> better status updates and simpler implementation IMO outweigh any benefits
> of the current approach (disclaimer: I filed the slightly inflammatory
> MESOS-9448).
>
> What is keeping us from moving forward with (III) at this point?
>
>
> Cheers,
>
> Benjamin
>
> > On Jan 3, 2019, at 11:30 PM, Benno Evers  wrote:
> >
> > Hi Chun-Hung,
> >
> > > imagine that there are 1k nodes and 10 active + 10 gone LRPs per node,
> then the master need to maintain 20k entries for LRPs.
> >
> > How big would the required additional storage be in this scenario? Even
> if it's 1KiB per LRP, using 20 MiB of extra memory doesn't sound too bad
> for such a big custer.
> >
> > In general, it seems hard to discuss the trade-offs between your
> proposals without looking at the users of that API - do you know if there
> are ayn frameworks out there that already use
> >  operation reconciliation, and if so what do they do based on the
> reconciliation response?
> >
> > As far as I know, we don't have any formal guarantees on which
> operations status changes the framework will receive without
> reconciliation. So putting on my framework-implementer hat it seems like
> I'd have no choice but to implement a continously polling background loop
> anyways if I care about knowing the latest operation statuses. If this is
> indeed the case, having a synchronous `RECONCILE_OPERATIONS` would seem to
> have little additional benefit.
> >
> > Best regards,
> > Benno
> >
> > On Wed, Dec 12, 2018 at 4:07 AM Chun-Hung Hsiao 
> wrote:
> > Hi folks,
> >
> > Recently I've being discussing the problems of the current design of the
> > experimental
> > `RECONCILE_OPERATIONS` scheduler API with a couple people. The discussion
> > was started
> > from MESOS-9318 <https://is

Re: ACK status update before or after handler?

2018-12-20 Thread James DeFelice
ACK'ing can be performed completely async w/ respect to event handling,
there's no need to perform an ACK on the same goroutine/thread as the event
handling. Also, the example framework in mesos-go doesn't implement robust
recovery in the face of failure.

Generally speaking, you only want to ACK an update once your framework has
durably recorded the fact that the update has been received from Mesos; for
example, by updating some persistent state store. There's some additional
discussion about ACKs in the Mesos docs:
http://mesos.apache.org/documentation/latest/high-availability-framework-guide/#mesos-architecture
.


On Wed, Dec 19, 2018 at 3:16 PM Michał Łowicki  wrote:

> Hey!
>
> By looking at mesos-go <https://github.com/mesos/mesos-go> I've found
> that in example that ACK is done before handler is being called (here
> <https://github.com/mesos/mesos-go/blob/master/api/v1/cmd/msh/msh.go#L185>).
> Couldn't find similar way to do it after handler. I guess it depends on the
> use case but what usually works better (if any)? In my understanding if ACK
> is before handler then it may end up in situation where handler for e.g.
> RUNNING and FAILED states interleave:
>
> RUNNING -> ACK -> handler for RUNNING starts -> FINISHED -> ACK -> handler
> for FINISHED starts -> handler for FINISHED ends -> handler for RUNNING
> ends.
>
> When ACK is always after handler then handler for RUNNING state will end
> before handler for FINISHED even starts. Any problems with such approach?
>
> --
> BR,
> Michał Łowicki
>


-- 
James DeFelice
585.241.9488 (voice)
650.649.6071 (fax)


Re: How is running 1.7.0 in production?

2018-11-13 Thread James Peach


> On Nov 13, 2018, at 5:45 PM, Stuart Elston  wrote:
> 
> Hi everyone,
> 
> We are contemplating an upgrade to Mesos 1.7.0 but are generally a little 
> wary of running .0 releases.  Has anyone encountered any showstoppers while 
> running 1.7.0?  We'd be curious to hear your experiences!

I’ve been running something slightly pre- the 1.7.0 release tag in prod for a 
long time and it’s fine. I’m currently rolling out a post- 1.7.0 snapshot and 
that’s going well so far.

J

[ANNOUNCE] mesos_exporter 1.1.1 released

2018-10-25 Thread James Peach
Hi all,

Just a quick note to say that mesos_exporter 1.1.1 has been released. This is a 
bug fix release that fixes a regression I introduced to v1.1.0. Source code an 
binaries are available on Github.

https://github.com/mesos/mesos_exporter/releases/tag/v1.1.1

Thanks to Chase Sillevis who contributed the fix for this release.

cheers,
James

Re: Propose to run debug container as the same user of its parent container by default

2018-10-25 Thread James Peach


> On Oct 23, 2018, at 7:47 PM, Qian Zhang  wrote:
> 
> Hi all,
> 
> Currently when launching a debug container (e.g., via `dcos task exec` or 
> command health check) to debug a task, by default Mesos agent will use the 
> executor's user as the debug container's user. There are actually 2 cases:
> 1. Command task: Since the command executor's user is same with command 
> task's user, so the debug container will be launched as the same user of the 
> command task.
> 2. The task in a task group: The default executor's user is same with the 
> framework user, so in this case the debug container will be launched as the 
> same user of the framework rather than the task.
> 
> Basically I think the behavior of case 1 is correct. For case 2, we may run 
> into a situation that the task is run as a user (e.g., root), but the debug 
> container used to debug that task is run as another user (e.g., a normal 
> user, suppose framework is run as a normal user), this may not be what user 
> expects.
> 
> So I created MESOS-9332  
> and propose to run debug container as the same user of its parent container 
> (i.e., the task to be debugged) by default. Please let me know if you have 
> any comments, thanks!

This sounds like a sensible default to me. I can imagine for debug use cases 
you might want to run the debug container as root or give it elevated 
capabilities, but that should not be the default.

J

Re: [VOTE] Release Apache Mesos 1.7.0 (rc3)

2018-09-14 Thread James Peach
+1 (binding)

make check on Fedora 28

> On Sep 11, 2018, at 11:09 AM, Gastón Kleiman  wrote:
> 
> Hi all,
> 
> Please vote on releasing the following candidate as Apache Mesos 1.7.0.
> 
> 
> 1.7.0 includes the following:
> 
> * Performance Improvements:
>   * Master `/state` endpoint: ~130% throughput improvement through RapidJSON
>   * Allocator: Improved allocator cycle significantly
>   * Agent `/containers` endpoint: Fixed a performance issue
>   * Agent container launch / destroy throughput is significantly improved
> * Containerization:
>   * **Experimental** Supported docker image tarball fetching from HDFS
>   * Added new `cgroups/all` and `linux/devices` isolators
>   * Added metrics for `network/cni` isolator and docker pull latency
> * Windows:
>   * Added support to libprocess for the Windows Thread Pool API
> * Multi-Framework Workloads:
>   * **Experimental** Added per-framework metrics to the master
>   * A new weighted random sorter was added as an alternative to the DRF sorter
> 
> The CHANGELOG for the release is available at:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.0-rc3
>  
> 
> 
> 
> The candidate for Mesos 1.7.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc3/mesos-1.7.0.tar.gz 
> 
> 
> The tag to be voted on is 1.7.0-rc3:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc3 
> 
> 
> The SHA512 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc3/mesos-1.7.0.tar.gz.sha512
>  
> 
> 
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc3/mesos-1.7.0.tar.gz.asc 
> 
> 
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS 
> 
> 
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1234 
> 
> 
> Please vote on releasing this package as Apache Mesos 1.7.0!
> 
> The vote is open until Fri Sep 14 11:06:30 PDT 2018 and passes if a majority 
> of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Mesos 1.7.0
> [ ] -1 Do not release this package because ...
> 
> Thanks,
> 
> Chun-Hung & Gastón



Re: Libevent bundling ahead.

2018-09-12 Thread James Peach



> On Sep 11, 2018, at 6:14 PM, Till Toenshoff  wrote:
> 
> Hey All,
> 
> We are considering bundling/vendoring libevent 2.0.22 with upcoming releases 
> of Mesos.
> 
> Let me explain the motivation and then go into some details.
> 
> Due to https://issues.apache.org/jira/browse/MESOS-7076, SSL builds Mesos 
> stopped functioning on distributions that offer libevent 2.1.8 by default. 
> Specifically the failure was observed on Ubuntu 17/18 as well as on macOS. It 
> has also just come to my attention that Fedora 18 shares the same fate.

F28

> So the problem is less likely OS specific but more likely libevent + SSL + 
> libprocess specific.
> Instead of getting stuck in the rabbit hole of debugging right away, I 
> decided that bundling a known good version of libevent was the most reliable 
> way to prevent sad faces when building Mesos with SS but instead we can be 
> sure SSL builds of Mesos function properly across all supported platforms, 
> out of the box.
> 
> Details on the bundling;
> We will include libevent 2.0.22 and we also include a patch that makes that 
> version build against both openssl 1.0.x as well as 1.1.x. For unbundled 
> builds (--with-libevent) I have some additional checks foreseen that try to 
> prevent a build of a known bad variant of libevent + SSL + Mesos.
> 
> The bundling and those checks are a workaround, not a solution. I still am 
> pursueing debugging the underlying cause. However, way too much time has 
> passed already without a proper solution, hence this suggestion of a quick 
> fix, bundling workaround.
> 
> Let me know your thoughts!

I think this is OK as long as we have a reasonable expectation that we can 
unbundle soon-ish.

J

Re: make check failed, but mesos-tests.sh --gtest_filter="SVNTest.DiffPatch" tests passed

2018-09-04 Thread James Peach
This might be caused by inconsistent linking in Homebrew. Try forcing Homebrew 
to build svn from source, something like this: brew install --force 
--build-from-source subversion


> On Sep 4, 2018, at 2:29 AM, Chang Shawn  wrote:
> 
> After 'make' succesfully on my macOS 10.13.6, I run 'make check', but fail on 
> test case "SVNTest.DiffPatch".The error output is:
> 
> [--] 2 tests from SVNTest
> 
> [ RUN  ] SVNTest.DiffPatch
> 
> *** Aborted at 1536051660 (unix time) try "date -d @1536051660" if you are 
> using GNU date ***
> 
> PC: @0x1094239d6 apr_pool_create_ex
> 
> *** SIGSEGV (@0x30) received by PID 84174 (TID 0x7fff8a2b6380) stack trace: 
> ***
> 
> @ 0x7fff51ab0f5a _sigtramp
> 
> @0x0 (unknown)
> 
> @0x10922380e svn_pool_create_ex
> 
> @0x107e13f4e svn::diff()
> 
> @0x107e133eb SVNTest_DiffPatch_Test::TestBody()
> 
> @0x107fbbebe 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> 
> @0x107f5c01b 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> 
> @0x107f5bf46 testing::Test::Run()
> 
> @0x107f5dd5d testing::TestInfo::Run()
> 
> @0x107f5f38c testing::TestCase::Run()
> 
> @0x107f6fbac testing::internal::UnitTestImpl::RunAllTests()
> 
> @0x107fbf14e 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> 
> @0x107f6f5db 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> 
> @0x107f6f49c testing::UnitTest::Run()
> 
> @0x107b51ab1 RUN_ALL_TESTS()
> 
> @0x107b51825 main
> 
> @ 0x7fff517a2015 start
> 
> make[6]: *** [check-local] Segmentation fault: 11
> 
> make[5]: *** [check-am] Error 2
> 
> make[4]: *** [check-recursive] Error 1
> 
> make[3]: *** [check] Error 2
> 
> make[2]: *** [check-recursive] Error 1
> 
> make[1]: *** [check] Error 2
> 
> make: *** [check-recursive] Error 1
> 
> So I run with ./bin/mesos-tests.sh --gtest_filter="SVNTest.DiffPatch" try to 
> get more information, but it seems that tests passed:

The SVN tests are part of stout (but are run during make check):

 ./3rdparty/stout/stout-tests --gtest_list_tests 
--gtest_filter="SVNTest.DiffPatch"
SVNTest.
  DiffPatch

J

Re: [VOTE] Release Apache Mesos 1.7.0 (rc2)

2018-08-29 Thread James Peach
+1 (binding)

Built and tested on Fedora 28 (clang).

> On Aug 24, 2018, at 4:42 PM, Chun-Hung Hsiao  wrote:
> 
> Hi all,
> 
> Please vote on releasing the following candidate as Apache Mesos 1.7.0.
> 
> 
> 1.7.0 includes the following:
> 
> * Performance Improvements:
>   * Master `/state` endpoint: ~130% throughput improvement through RapidJSON
>   * Allocator: Improved allocator cycle significantly
>   * Agent `/containers` endpoint: Fixed a performance issue
>   * Agent container launch / destroy throughput is significantly improved
> * Containerization:
>   * **Experimental** Supported docker image tarball fetching from HDFS
>   * Added new `cgroups/all` and `linux/devices` isolators
>   * Added metrics for `network/cni` isolator and docker pull latency
> * Windows:
>   * Added support to libprocess for the Windows Thread Pool API
> * Multi-Framework Workloads:
>   * **Experimental** Added per-framework metrics to the master
>   * A new weighted random sorter was added as an alternative to the DRF sorter
> 
> The CHANGELOG for the release is available at:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.0-rc2
>  
> 
> 
> 
> The candidate for Mesos 1.7.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz 
> 
> 
> The tag to be voted on is 1.7.0-rc2:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc2 
> 
> 
> The SHA512 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz.sha512
>  
> 
> 
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/mesos-1.7.0.tar.gz.asc 
> 
> 
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS 
> 
> 
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1233 
> 
> 
> Please vote on releasing this package as Apache Mesos 1.7.0!
> 
> The vote is open until Mon Aug 27 16:37:35 PDT 2018 and passes if a majority 
> of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Mesos 1.7.0
> [ ] -1 Do not release this package because ...
> 
> Thanks,
> Chun-Hung & Gaston



[ANNOUNCE] mesos_exporter 1.1.0 released

2018-08-23 Thread James Peach
Hi all,

I'm please to announce that mesos_exporter 1.1.0 has been released. This is a 
minor release with  collection of bug fixes and minor features.

https://github.com/mesos/mesos_exporter/releases/tag/v1.1.0

Many thanks to Mesosphere, who kindly contributed the code to the Mesos 
community, and to the following contributors: Alan Bover, Eric Lubow, Hector 
Fernandez, Jack Thomasson, James Peach, Jonathan Sokolowski, Philip Norman, 
Stephan Erb, Trevor Wood and Vinod Kone.

cheers,
James

Re: Volume ownership and permission

2018-08-16 Thread James Peach



> On Aug 15, 2018, at 6:22 PM, Qian Zhang  wrote:
> 
> Hi Folks,
> 
> We found some issues for the solutions of this project and propose a better
> one, see here
> <https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.tjuy5xk67tuu>
> for details. Please let me know if you have any comments, thanks!

Some general comments.

I assume that this scheme will only be supported on Linux, due to the 
dependencies on the Linux ACLs and supplementary group behaviour?  

Rewriting ACLs on volumes at each container launch sounds hugely expensive. 
It's IOP-bound process and there are an effectively unbounded number of files 
in the volume. Would this serialize container cleanup?

It seems like ACL evaluation will mean that this scheme will only mostly work. 
For example, if the container process UID matches a user ACE, then access could 
be denied independently of the volume policy.

Will the VolumeAclManager apply a default ACL on the root of the volume? Does 
this imply that when it updates the ACEs for the container GID, it also needs 
to update the default ACLs on all directories?

> 
> 
> Regards,
> Qian Zhang
> 
> On Sat, Apr 28, 2018 at 7:57 AM, Qian Zhang  wrote:
> 
>>> The framework launched tasks in a group with different users? Sounds
>> like they dug their own hole :)
>> 
>> So you mean we should actually put a best practice or limitation in doc:
>> when launching a task group with multiple tasks to share a SANDBOX volume
>> of PARENT type, all the tasks should be run with the same user, and that
>> user must be same with the user to launch the executor? Otherwise the task
>> will not be able to write to the volume.
>> 
>>> I'd argue that the "rw" on the sandbox path is analogous to the "rw"
>> mount option. That is, it is mounted writeable, but says nothing about
>> which credentials can write to it.
>> 
>> Can you please elaborate a bit on this? What would you suggest for the
>> "rw` volume mode?
>> 
>> 
>> Regards,
>> Qian Zhang
>> 
>> On Fri, Apr 27, 2018 at 12:07 PM, James Peach  wrote:
>> 
>>> 
>>> 
>>>> On Apr 26, 2018, at 7:25 PM, Qian Zhang  wrote:
>>>> 
>>>> Hi James,
>>>> 
>>>> Thanks for your comment!
>>>> 
>>>> I think you are talking about the SANDBOX_PATH volume ownership issue
>>>> mentioned in the design doc
>>>> <https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE
>>> -v4KWwjmnCR0l8V4Tq2U/edit#heading=h.s6f8rmu65g2p>,
>>>> IIUC, you prefer to leaving it to framework, i.e., framework itself
>>> ought
>>>> to be able to handle such issue. But I am curious how framework can
>>> handle
>>>> it in such situation. If the framework launches a task group with
>>> different
>>>> users and with a SANDBOX_PATH volume of PARENT type, the tasks in the
>>> group
>>>> will definitely fail to write to the volume due to the ownership issue
>>>> though the volume's mode is set to "rw". So in this case, how should
>>>> framework handle it?
>>> 
>>> The framework launched tasks in a group with different users? Sounds like
>>> they dug their own hole :)
>>> 
>>> I'd argue that the "rw" on the sandbox path is analogous to the "rw"
>>> mount option. That is, it is mounted writeable, but says nothing about
>>> which credentials can write to it.
>>> 
>>>> And if we want to document it, what is our recommended
>>>> solution in the doc?
>>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> Qian Zhang
>>>> 
>>>> On Fri, Apr 27, 2018 at 1:16 AM, James Peach  wrote:
>>>> 
>>>>> I commented on the doc, but at least some of the issues raised there I
>>>>> would not regard as issues. Rather, they are about setting expectations
>>>>> correctly and ensuring that we are documenting (and maybe enforcing)
>>>>> sensible behavior.
>>>>> 
>>>>> I'm not that keen on Mesos automatically "fixing" filesystem
>>> permissions
>>>>> and we should proceed down that path with caution, especially in the
>>> ACLs
>>>>> case.
>>>>> 
>>>>>> On Apr 10, 2018, at 3:15 AM, Qian Zhang  wrote:
>>>>>> 
>>>>>> Hi Folks,
>>>>>> 
>>>>>> I am working on MESOS-8767 to improve Mesos volume support regarding
>>>>> volume ownership and permission, here is the design doc. Please feel
>>> free
>>>>> to let me know if you have any comments/feedbacks, you can reply this
>>> mail
>>>>> or comment on the design doc directly. Thanks!
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Qian Zhang
>>>>> 
>>>>> 
>>> 
>>> 
>> 



Re: [VOTE] Move the project repos to gitbox

2018-07-17 Thread James Peach



> On Jul 17, 2018, at 7:58 AM, Vinod Kone  wrote:
> 
> Hi,
> 
> As discussed in another thread and in the committers sync, there seem to be 
> heavy interest in moving our project repos ("mesos", "mesos-site") from the 
> "git-wip" git server to the new "gitbox" server to better avail GitHub 
> integrations.
> 
> Please vote +1, 0, -1 regarding the move to gitbox. The vote will close in 3 
> business days.


+1

implicit mesos-local support in scheduler drivers

2018-07-03 Thread James Peach
Hi all,

I found recently, that the Mesos scheduler drivers will implicitly spin up a 
`mesos-local` cluster for testing if your scheduler uses the Mesos scheduler 
drivers, specifies “local” as the master, and exports “MESOS_" environment 
variables to configure the master. Do any scheduler authors use this? If so, 
can you desribe the workflow?

thanks,
James

Re: narrowing task sandbox permissions

2018-06-15 Thread James Peach



> On Jun 15, 2018, at 11:06 AM, Zhitao Li  wrote:
> 
> Sorry for getting back to this really late, but we got bit by this behavior
> change in our environment.
> 
> The broken scenario we had:
> 
>   1. We are using Aurora to launch docker containerizer based tasks on
>   Mesos;
>   2. Most of our docker containers had some legacy behavior: *the
>   execution entered as "root" in the entry point script,* setup a couple
>   of symlinks and other preparation work, then *"de-escalate" into a non
>   privileged user (i.e, "user")*;
>  1. This was added so that the entry point script has enough
>  permission to reconfigure certain side car processes (i.e, nginx) and
>  filesystem paths;
>   3. unfortunately, the "user" user will lose access to the sandbox after
>   this change.
> 
> 
> While I'd acknowledge that above behavior is legacy and a piece of major
> tech debt, cleaning it up for the thousands of applications on our platform
> was never easy. Given that our org has other useful features available in
> 1.6, I would propose a couple of options:
> 
>   1. making the sandbox permission bits configurable
>  1. Certain framework knows that their tasks do not leave sensitive
>  data on sandbox so we could provide this flexibility (it's very useful in
>  practice for migration to a container based system);
>  2. Alternatively, making this possible to reconfigure on agent flags:
>  This could be more secure and easier to manage, but lacks flexibility of
>  allowing different frameworks to do different things.
>   2. Until the customization is in place, consider a revert of the
>   permission bit change so we preserve the original behavior.

That's a pretty unfortunate outcome. Can you change the permissions in your 
script, or happy a Mesos patch until the legacy can be addressed?

J

Re: Deprecating the Python bindings

2018-06-06 Thread James Peach


> On May 9, 2018, at 11:51 AM, Andrew Schwartzmeyer  
> wrote:
> 
> Hi all,
> 
> There are two parallel efforts underway that would both benefit from 
> officially deprecating (and then removing) the Python bindings. The first 
> effort is the move to the CMake system: adding support to generate the Python 
> bindings was investigated but paused (see MESOS-8118), and the second effort 
> is the move to Python 3: producing Python 3 compatible bindings is under 
> investigation but not in progress (see MESOS-7163).
> 
> Benjamin Bannier, Joseph Wu, and I have all at some point just wondered how 
> the community would fare if the Python bindings were officially deprecated 
> and removed. So please, if this would negatively impact you or your project, 
> let me know in this thread.

Another approach could be to move the bindings from the `mesos` git repo to a 
separate repo (either the ASF or in the `mesos` GitHub org). This could 
decouple it from the main Mesos build infrastructure and create a project for a 
Python community to coalesce around. I think there's value in nominating an 
official Python binding, but maybe we don't have to carry that in the same git 
repo and build system.

J

Re: Volume ownership and permission

2018-04-26 Thread James Peach


> On Apr 26, 2018, at 7:25 PM, Qian Zhang <zhq527...@gmail.com> wrote:
> 
> Hi James,
> 
> Thanks for your comment!
> 
> I think you are talking about the SANDBOX_PATH volume ownership issue
> mentioned in the design doc
> <https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.s6f8rmu65g2p>,
> IIUC, you prefer to leaving it to framework, i.e., framework itself ought
> to be able to handle such issue. But I am curious how framework can handle
> it in such situation. If the framework launches a task group with different
> users and with a SANDBOX_PATH volume of PARENT type, the tasks in the group
> will definitely fail to write to the volume due to the ownership issue
> though the volume's mode is set to "rw". So in this case, how should
> framework handle it?

The framework launched tasks in a group with different users? Sounds like they 
dug their own hole :)

I'd argue that the "rw" on the sandbox path is analogous to the "rw" mount 
option. That is, it is mounted writeable, but says nothing about which 
credentials can write to it.

> And if we want to document it, what is our recommended
> solution in the doc?
> 
> 
> 
> Regards,
> Qian Zhang
> 
> On Fri, Apr 27, 2018 at 1:16 AM, James Peach <jpe...@apache.org> wrote:
> 
>> I commented on the doc, but at least some of the issues raised there I
>> would not regard as issues. Rather, they are about setting expectations
>> correctly and ensuring that we are documenting (and maybe enforcing)
>> sensible behavior.
>> 
>> I'm not that keen on Mesos automatically "fixing" filesystem permissions
>> and we should proceed down that path with caution, especially in the ACLs
>> case.
>> 
>>> On Apr 10, 2018, at 3:15 AM, Qian Zhang <zhq527...@gmail.com> wrote:
>>> 
>>> Hi Folks,
>>> 
>>> I am working on MESOS-8767 to improve Mesos volume support regarding
>> volume ownership and permission, here is the design doc. Please feel free
>> to let me know if you have any comments/feedbacks, you can reply this mail
>> or comment on the design doc directly. Thanks!
>>> 
>>> 
>>> Regards,
>>> Qian Zhang
>> 
>> 



Re: Volume ownership and permission

2018-04-26 Thread James Peach
I commented on the doc, but at least some of the issues raised there I would 
not regard as issues. Rather, they are about setting expectations correctly and 
ensuring that we are documenting (and maybe enforcing) sensible behavior. 

I'm not that keen on Mesos automatically "fixing" filesystem permissions and we 
should proceed down that path with caution, especially in the ACLs case.

> On Apr 10, 2018, at 3:15 AM, Qian Zhang  wrote:
> 
> Hi Folks,
> 
> I am working on MESOS-8767 to improve Mesos volume support regarding volume 
> ownership and permission, here is the design doc. Please feel free to let me 
> know if you have any comments/feedbacks, you can reply this mail or comment 
> on the design doc directly. Thanks!
> 
> 
> Regards,
> Qian Zhang



Re: Update the *Minimum Linux Kernel version* supported on Mesos

2018-04-05 Thread James Peach


> On Apr 5, 2018, at 5:00 AM, Andrei Budnik  wrote:
> 
> Hi All,
> 
> We would like to update minimum supported Linux kernel from 2.6.23 to
> 2.6.28.
> Linux kernel supports cgroups v1 starting from 2.6.24, but `freezer` cgroup
> functionality was merged into 2.6.28, which supports nested containers.

User namespaces require >= 3.12 (November 2013). Can we make that the minimum?

J

Re: Support deadline for tasks

2018-03-23 Thread James Peach


> On Mar 23, 2018, at 9:57 AM, Renan DelValle  wrote:
> 
> Hi Zhitao,
> 
> Since this is something that could potentially be handled by the executor 
> and/or framework, I was wondering if you could speak to the advantages of 
> making this a TaskInfo primitive vs having the executor (or even the 
> framework) handle it.

There's some discussion around this on 
https://issues.apache.org/jira/browse/MESOS-8725.

My take is that delegating too much to the scheduler makes schedulers harder to 
write and exacerbates the complexity of the system. If 4 different schedulers 
implement this feature, operators are likely to need to understand 4 different 
ways of doing the same thing, which would be unfortunate. 

J

Re: Support deadline for tasks

2018-03-22 Thread James Peach


> On Mar 22, 2018, at 10:06 AM, Zhitao Li  wrote:
> 
> In our environment, we run a lot of batch jobs, some of which have tight 
> timeline. If any tasks in the job runs longer than x hours, it does not make 
> sense to run it anymore. 
>  
> For instance, a team would submit a job which builds a weekly index and 
> repeats every Monday. If the job does not finish before next Monday for 
> whatever reason, there is no point to keep any task running.
>  
> We believe that implementing deadline tracking distributed across our cluster 
> makes more sense as it makes the system more scalable and also makes our 
> centralized state machine simpler.
>  
> One idea I have right now is to add an  optional TimeInfo deadline to 
> TaskInfo field, and all default executors in Mesos can simply terminate the 
> task and send a proper StatusUpdate.
> 
> I summarized above idea in MESOS-8725.
> 
> Please let me know what you think. Thanks! 

This sounds both useful and simple to implement. I’m happy to shepherd if you’d 
like

J

Re: Build Failure

2018-03-19 Thread James Peach


> On Mar 19, 2018, at 4:38 PM, Shiv Deepak  wrote:
> 
> Thanks. I installed unzip. That worked.

FWIW the test suite was fixed for 1.6 in 
0da7b6cc37786df94465ae98948fd7be669a843e.

> 
> On Mon, Mar 19, 2018 at 3:48 PM, Tomek Janiszewski  wrote:
> Do you have unzip installed? Can you try unzipping file like it's done in the 
> test? 
> 
> 
> pon., 19.03.2018, 22:53 użytkownik Shiv Deepak  napisał:
> Hello,
> 
> I am trying to build Mesos 1.5.0 from source on Ubuntu 16.04.
> 
> I tried on Docker, VM, and EC2. Three test cases are failing no matter what.
> 
> Here is the list.
> 
> [  PASSED  ] 1904 tests.
> [  FAILED  ] 3 tests, listed below:
> [  FAILED  ] FetcherTest.Unzip_ExtractFile
> [  FAILED  ] FetcherTest.Unzip_ExtractInvalidFile
> [  FAILED  ] FetcherTest.Unzip_ExtractFileWithDuplicatedEntries
> 
> Here is the test output:
> 
> [ RUN  ] FetcherTest.Unzip_ExtractFile
> ../../src/tests/fetcher_tests.cpp:870: Failure
> (fetch).failure(): Failed to fetch all URIs for container 
> '709de28f-5f71-439d-a032-072df865090f': exited with status 1
> [  FAILED  ] FetcherTest.Unzip_ExtractFile (297 ms)
> [ RUN  ] FetcherTest.Unzip_ExtractInvalidFile
> ../../src/tests/fetcher_tests.cpp:936: Failure
> Value of: os::exists(extractedFile)
>   Actual: false
> Expected: true
> [  FAILED  ] FetcherTest.Unzip_ExtractInvalidFile (201 ms)
> [ RUN  ] FetcherTest.Unzip_ExtractFileWithDuplicatedEntries
> ../../src/tests/fetcher_tests.cpp:997: Failure
> (fetch).failure(): Failed to fetch all URIs for container 
> 'dd749015-3d16-4926-b7f3-e1c96211a461': exited with status 1
> [  FAILED  ] FetcherTest.Unzip_ExtractFileWithDuplicatedEntries (201 ms)
> 
> Is this expected or do I need to fix something? Can someone please point me 
> in the right direction?
> 
> Thank you
> 
> -- 
> 
> Shiv Deepak▌
> Engineering Manager
> HackerRank
> 
> Blog / Twitter / Linkedin / Facebook
> 
> 
> 
> 
> -- 
> 
> Shiv Deepak▌
> Engineering Manager
> HackerRank
> 
> Blog / Twitter / Linkedin / Facebook



Re: [VOTE] Release Apache Mesos 1.5.0 (rc2)

2018-02-07 Thread James Peach
+1 (binding)

Tested on Fedora 27

> On Feb 1, 2018, at 5:36 PM, Gilbert Song  wrote:
> 
> Hi all,
> 
> Please vote on releasing the following candidate as Apache Mesos 1.5.0.
> 
> 1.5.0 includes the following:
> 
>  * Support Container Storage Interface (CSI).
>  * Agent reconfiguration policy.
>  * Auto GC docker images in Mesos Containerizer.
>  * Standalone containers.
>  * Support gRPC client.
>  * Non-leading VOTING replica catch-up.
> 
> 
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.5.0-rc2
> 
> 
> The candidate for Mesos 1.5.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.0-rc2/mesos-1.5.0.tar.gz
> 
> The tag to be voted on is 1.5.0-rc2:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.5.0-rc2
> 
> The MD5 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.0-rc2/mesos-1.5.0.tar.gz.md5
> 
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.0-rc2/mesos-1.5.0.tar.gz.asc
> 
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
> 
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1222
> 
> Please vote on releasing this package as Apache Mesos 1.5.0!
> 
> The vote is open until Tue Feb  6 17:35:16 PST 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Mesos 1.5.0
> [ ] -1 Do not release this package because ...
> 
> Thanks,
> Jie and Gilbert



Re: [VOTE] Release Apache Mesos 1.5.0 (rc1)

2018-01-24 Thread James Peach
+1

Verified on CentOS 6 and Fedora 27

> On Jan 22, 2018, at 7:15 PM, Gilbert Song  wrote:
> 
> Hi all,
> 
> Please vote on releasing the following candidate as Apache Mesos 1.5.0.
> 
> 1.5.0 includes the following:
> 
>   * Support Container Storage Interface (CSI).
>   * Agent reconfiguration policy.
>   * Auto GC docker images in Mesos Containerizer.
>   * Standalone containers.
>   * Support gRPC client.
>   * Non-leading VOTING replica catch-up.
> 
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.5.0-rc1
> 
> 
> The candidate for Mesos 1.5.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.0-rc1/mesos-1.5.0.tar.gz
> 
> The tag to be voted on is 1.5.0-rc1:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.5.0-rc1
> 
> The MD5 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.0-rc1/mesos-1.5.0.tar.gz.md5
> 
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.0-rc1/mesos-1.5.0.tar.gz.asc
> 
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
> 
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1221
> 
> Please vote on releasing this package as Apache Mesos 1.5.0!
> 
> The vote is open until Thu Jan 25 18:24:36 PST 2018 and passes if a majority 
> of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Mesos 1.5.0
> [ ] -1 Do not release this package because ...
> 
> Thanks,
> Jie and Gilbert



Re: Doc-a-thon - January 11th, 2018

2018-01-09 Thread James Peach
Just a reminder that the Docathon is this Thursday :)

> On Nov 21, 2017, at 4:14 PM, Judith Malnick  wrote:
> 
> Hi all,
> 
> I'm excited to announce the next Apache Mesos doc-a-thon!
> 
> *Date:* January 11th, 2018
> 
> Location:
> 
> Mesosphere HQ
> 
> 88 Stevenson Street
> 
> San Francisco, CA
> 
> Schedule (Pacific time):
> 
> 3 - 3:30 PM: Discuss docs projects, split into groups
> 
> 3:30 - 6:30 PM: Work on docs
> 
> 6:30 - 7 PM: Present progress
> 
> 7 - 8 PM: Drinks and hangout!
> 
> 
> If you will be attending in person, please RSVP
>  so we
> know how much food to get.
> If you plan on attending remotely, you can with this Zoom link
> .
> Feel free to brainstorm project proposals on this planning doc
> .
> 
> 
> Let me know if you have any questions. I'm looking forward to seeing all of
> you and your amazing projects!
> 
> All the Best,
> Judith
> -- 
> Judith Malnick
> Community Manager
> 310-709-1517



Re: Container user '27' is not supported

2017-12-25 Thread James Peach


> On Dec 25, 2017, at 2:22 PM, Marc Roos <m.r...@f1-outsourcing.eu> wrote:
> 
> 
> Should this be done via the parameters? What key?
> 
> "parameters": [{ "key": "net", "value": "host" }]
> 
> 
> {
>  "id": "sflow/vizceral",
>  "cmd": null,
>  "cpus": 0.2,
>  "mem": 256,
>  "instances": 1,
>  "acceptedResourceRoles": ["*"],
>  "constraints": [["hostname", "CLUSTER", "m02.local"]],
>  "container": {
>"type": "MESOS",
>"docker": {
>  "image": "sflow/vizceral",
>  "credential": null,
>  "forcePullImage": false
>}
> 
>  }
> }

I guess this is a Marathon task spec? I’m not familiar with the Marathon API, 
but it looks to me like you would specify the “user” field in application:

https://docs.mesosphere.com/1.9/deploying-services/marathon-api/#/apps/V2Apps3

> 
> 
> Dec 25 23:15:40 m02 mesos-slave[18569]: W1225 23:15:40.251715 18595 
> runtime.cpp:111] Container user 'sflowrt' is not supported yet for 
> container 375b21ca-2d12-4a81-8429-897aac75eaa0
> Dec 25 23:15:40 m02 mesos-slave[18569]: W1225 23:15:40.251715 18595 
> runtime.cpp:111] Container user 'sflowrt' is not supported yet for 
> container 375b21ca-2d12-4a81-8429-897aac75eaa0
> 
> -Original Message-
> From: James Peach [mailto:jor...@gmail.com] 
> Sent: zondag 24 december 2017 18:01
> To: user
> Subject: Re: Container user '27' is not supported
> 
> 
> 
>> On Dec 24, 2017, at 5:20 AM, Marc Roos <m.r...@f1-outsourcing.eu> 
> wrote:
>> 
>> 
>> I am seeing this in the logs:
>> 
>> Container user '27' is not supported yet for container
>> d823196a-4ec3-41e3-a4c0-6680ba5cc99
>> 
>> I guess this means that the container requests to run under a specific 
> 
>> user id, and this is not yet available in mesos?
> 
> This means that the containerizer parsed the continaer user out of the 
> manifest, but we don’t support running the container as that user. You 
> should continue to use the TaskInfo message to specify which user the 
> container will run as.
> 
> J
> 



Re: Container user '27' is not supported

2017-12-24 Thread James Peach


> On Dec 24, 2017, at 5:20 AM, Marc Roos  wrote:
> 
> 
> I am seeing this in the logs:
> 
> Container user '27' is not supported yet for container 
> d823196a-4ec3-41e3-a4c0-6680ba5cc99
> 
> I guess this means that the container requests to run under a specific 
> user id, and this is not yet available in mesos?

This means that the containerizer parsed the continaer user out of the 
manifest, but we don’t support running the container as that user. You should 
continue to use the TaskInfo message to specify which user the container will 
run as.

J

narrowing task sandbox permissions

2017-12-14 Thread James Peach
Hi all,

In https://issues.apache.org/jira/browse/MESOS-8332, I'm proposing a change to 
narrow the permissions used for the task sandbox directory from 0755 to 0750. 
Note that this change also makes failure to chown this directory into a hard 
failure.

I expect this is a safe change for well-behaved configurations, but please let 
me know if you have any compatibility concerns.

thanks,
James

Re: Adding a new agent terminates existing executors?

2017-11-15 Thread James Peach

> On Nov 15, 2017, at 8:24 AM, Dan Leary  wrote:
> 
> Yes, as I said at the outset, the agents are on the same host, with different 
> ip's and hostname's and work_dir's.
> If having separate work_dirs is not sufficient to keep containers separated 
> by agent, what additionally is required?

You might also need to specify other separate agent directories, like 
--runtime_dir, --docker_volume_checkpoint_dir, etc. Check the output of 
mesos-agent --flags.

> 
> 
> On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone  wrote:
> How is agent2 able to see agent1's containers? Are they running on the same 
> box!? Are they somehow sharing the filesystem? If yes, that's not supported.
> 
> On Wed, Nov 15, 2017 at 8:07 AM, Dan Leary  wrote:
> Sure, master log and agent logs are attached.
> 
> Synopsis:  In the master log, tasks t01 and t02 are running...
> 
> > I1114 17:08:15.972033  5443 master.cpp:6841] Status update TASK_RUNNING 
> > (UUID: 9686a6b8-b04d-4bc5-9d26-32d50c7b0f74) for task t01 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:19.142276  5448 master.cpp:6841] Status update TASK_RUNNING 
> > (UUID: a6c72f31-2e47-4003-b707-9e8c4fb24f05) for task t02 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> 
> Operator starts up agent2 around 17:08:50ish.  Executor1 and its tasks are 
> terminated
> 
> > I1114 17:08:54.835841  5447 master.cpp:6964] Executor 'executor1' of 
> > framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 
> > (agent1): terminated with signal Killed
> > I1114 17:08:54.835959  5447 master.cpp:9051] Removing executor 'executor1' 
> > with resources [] of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on 
> > agent 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 
> > (agent1)
> > I1114 17:08:54.837419  5436 master.cpp:6841] Status update TASK_FAILED 
> > (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t01 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:54.837497  5436 master.cpp:6903] Forwarding status update 
> > TASK_FAILED (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t01 
> > of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
> > I1114 17:08:54.837896  5436 master.cpp:8928] Updating the state of task 
> > t01 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest 
> > state: TASK_FAILED, status update state: TASK_FAILED)
> > I1114 17:08:54.839159  5436 master.cpp:6841] Status update TASK_FAILED 
> > (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t02 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:54.839221  5436 master.cpp:6903] Forwarding status update 
> > TASK_FAILED (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t02 
> > of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
> > I1114 17:08:54.839493  5436 master.cpp:8928] Updating the state of task 
> > t02 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest 
> > state: TASK_FAILED, status update state: TASK_FAILED)
> 
> But agent2 doesn't register until later...
> 
> > I1114 17:08:55.588762  5442 master.cpp:5714] Received register agent 
> > message from slave(1)@127.1.1.2:5052 (agent2)
> 
> Meanwhile in the agent1 log, the termination of executor1 appears to be the 
> result of the destruction of its container...
> 
> > I1114 17:08:54.810638  5468 containerizer.cpp:2612] Container 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 has exited
> > I1114 17:08:54.810732  5468 containerizer.cpp:2166] Destroying container 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state
> > I1114 17:08:54.810761  5468 containerizer.cpp:2712] Transitioning the state 
> > of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to DESTROYING
> 
> Apparently because agent2 decided to "recover" the very same container...
> 
> > I1114 17:08:54.775907  6041 linux_launcher.cpp:373] 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 is a known orphaned container
> > I1114 17:08:54.779634  6037 containerizer.cpp:966] Cleaning up orphan 
> > container cbcf6992-3094-4d0f-8482-4d68f68eae84
> > I1114 17:08:54.779705  6037 containerizer.cpp:2166] Destroying container 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state
> > I1114 17:08:54.779737  6037 containerizer.cpp:2712] Transitioning the state 
> > of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to DESTROYING
> > I1114 17:08:54.780740  6041 linux_launcher.cpp:505] Asked to destroy 
> > 

Re: 1.4.1 release

2017-11-03 Thread James Peach
I think MESOS-8169 is a candidate, but I don't be able to get to it until next 
week


> On Nov 3, 2017, at 1:48 AM, Qian Zhang <zhq527...@gmail.com> wrote:
> 
> And I will backport MESOS-8051 to 1.2.x, 1.3.x and 1.4.x.
> 
> 
> Regards,
> Qian Zhang
> 
> On Fri, Nov 3, 2017 at 9:01 AM, Qian Zhang <zhq527...@gmail.com> wrote:
> We want to backport https://reviews.apache.org/r/62518/ to 1.2.x, 1.3.x and 
> 1.4.x, James will work on it.
> 
> 
> Regards,
> Qian Zhang
> 
> On Fri, Nov 3, 2017 at 12:11 AM, Kapil Arya <ka...@mesosphere.io> wrote:
> Please reply to this email if you have pending patches to be backported to 
> 1.4.x as we are aiming to cut a release candidate for 1.4.1 early next week.
> 
> Thanks,
> Anand and Kapil
> 
> 



Re: clearing the executor authentication token from the task environment

2017-11-02 Thread James Peach

> On Nov 1, 2017, at 2:28 PM, James Peach <jor...@gmail.com> wrote:
> 
> Hi all,
> 
> In https://issues.apache.org/jira/browse/MESOS-8140, I'm proposing that we 
> clear the MESOS_EXECUTOR_AUTHENTICATION_TOKEN environment variable 
> immediately after consuming it in the built-in executors. This protects it 
> from observation by other tasks in the same PID namespace, however I wanted 
> to verify that no-one currently has a use case that depends on this. 
> Currently, the token is inherited to the environment of tasks running under 
> the command executor (i.e. not to task group tasks).
> 
> Eventually we would add a formal API for tasks to access the executor token 
> in MESOS-8018.

Ok, we will be landing this change for Mesos 1.5

thanks,
James

clearing the executor authentication token from the task environment

2017-11-01 Thread James Peach
Hi all,

In https://issues.apache.org/jira/browse/MESOS-8140, I'm proposing that we 
clear the MESOS_EXECUTOR_AUTHENTICATION_TOKEN environment variable immediately 
after consuming it in the built-in executors. This protects it from observation 
by other tasks in the same PID namespace, however I wanted to verify that 
no-one currently has a use case that depends on this. Currently, the token is 
inherited to the environment of tasks running under the command executor (i.e. 
not to task group tasks).

Eventually we would add a formal API for tasks to access the executor token in 
MESOS-8018.

thanks,
James

Re: Adding the limited resource to TaskStatus messages

2017-10-10 Thread James Peach

> On Oct 9, 2017, at 7:15 PM, Wil Yegelwel <wyegel...@gmail.com> wrote:
> 
> Is it correct to say that the limited resource field is *only* meant to 
> provide machine readable information about what resources limits were 
> exceeded?

Yes,

> If so, does it make sense to provide richer reporting fields for all failure 
> reasons? I imagine other failure reasons could benefit from being able to 
> report details of the failure that are machine readable.

Some other reasons already have their own structured information, eg. the 
TASK_UNREACHABLE state populates the `unreachable_time` field. I'm not planning 
to add structured information to any other failure reasons, but I'd support 
doing it if you have a specific suggestion.

> On Mon, Oct 9, 2017, 3:50 PM James Peach <jor...@gmail.com> wrote:
> 
> > On Oct 9, 2017, at 1:27 PM, Vinod Kone <vinodk...@apache.org> wrote:
> >
> >> In the case that a task is killed because it violated a resource
> >> constraint (ie. the reason field is REASON_CONTAINER_LIMITATION,
> >> REASON_CONTAINER_LIMITATION_DISK or REASON_CONTAINER_LIMITATION_MEMORY),
> >> this field may be populated with the resource that triggered the
> >> limitation. This is intended to give better information to schedulers about
> >> task resource failures, in the expectation that it will help them bubble
> >> useful information up to the user or a monitoring system.
> >>
> >
> > Can you elaborate what schedulers are expected to do with this information?
> > Looking for some concrete use cases if you can.
> 
> There's no concrete use case here; it's just a matter of propagating 
> information we know in a structured way.
> 
> If we assume that the scheduler knows about some sort of monitoring system or 
> has a UI, we can present this to the user or a system that can take action on 
> it. The status quo is that the raw message string is dumped to logs, and has 
> to be manually interpreted.
> 
> Additionally, this can pave the way to getting rid of 
> REASON_CONTAINER_LIMITATION_DISK and REASON_CONTAINER_LIMITATION_MEMORY. All 
> you really need is REASON_CONTAINER_LIMITATION plus the resource information.
> 
> J
> 



Re: Adding the limited resource to TaskStatus messages

2017-10-09 Thread James Peach

> On Oct 9, 2017, at 1:27 PM, Vinod Kone  wrote:
> 
>> In the case that a task is killed because it violated a resource
>> constraint (ie. the reason field is REASON_CONTAINER_LIMITATION,
>> REASON_CONTAINER_LIMITATION_DISK or REASON_CONTAINER_LIMITATION_MEMORY),
>> this field may be populated with the resource that triggered the
>> limitation. This is intended to give better information to schedulers about
>> task resource failures, in the expectation that it will help them bubble
>> useful information up to the user or a monitoring system.
>> 
> 
> Can you elaborate what schedulers are expected to do with this information?
> Looking for some concrete use cases if you can.

There's no concrete use case here; it's just a matter of propagating 
information we know in a structured way.

If we assume that the scheduler knows about some sort of monitoring system or 
has a UI, we can present this to the user or a system that can take action on 
it. The status quo is that the raw message string is dumped to logs, and has to 
be manually interpreted. 

Additionally, this can pave the way to getting rid of 
REASON_CONTAINER_LIMITATION_DISK and REASON_CONTAINER_LIMITATION_MEMORY. All 
you really need is REASON_CONTAINER_LIMITATION plus the resource information.

J



Adding the limited resource to TaskStatus messages

2017-10-09 Thread James Peach
Hi all,

In https://reviews.apache.org/r/62644/, I am proposing to add an optional 
Resources field to the TaskStatus message named `limited_resources`.

In the case that a task is killed because it violated a resource constraint 
(ie. the reason field is REASON_CONTAINER_LIMITATION, 
REASON_CONTAINER_LIMITATION_DISK or REASON_CONTAINER_LIMITATION_MEMORY), this 
field may be populated with the resource that triggered the limitation. This is 
intended to give better information to schedulers about task resource failures, 
in the expectation that it will help them bubble useful information up to the 
user or a monitoring system.

diff --git a/include/mesos/v1/mesos.proto b/include/mesos/v1/mesos.proto
index d742adbbf..559d09e37 100644
--- a/include/mesos/v1/mesos.proto
+++ b/include/mesos/v1/mesos.proto
@@ -2252,6 +2252,13 @@ message TaskStatus {
   // status updates for tasks running on agents that are unreachable
   // (e.g., partitioned away from the master).
   optional TimeInfo unreachable_time = 14;
+
+  // If the reason field indicates a container resource limitation,
+  // this field contains the resource whose limits were violated.
+  //
+  // NOTE: 'Resources' is used here because the resource may span
+  // multiple roles (e.g. `"mem(*):1;mem(role):2"`).
+  repeated Resource limited_resources = 16;
 }



cheers,
James




Re: RFC: Partition Awareness

2017-10-05 Thread James Peach

> On Jun 21, 2017, at 10:16 AM, Megha Sharma  wrote:
> 
> Thank you all for the feedback.
> To summarize, not killing tasks for non-Partition Aware frameworks will make 
> the schedulers see a higher volume of non terminal updates for tasks for 
> which they have already received a TASK_LOST but nothing new that they are 
> not seeing today. So, this shouldn’t be a breaking change for frameworks and 
> this will make the partition awareness logic simpler. I will update 
> MESOS-7215 with the details once the design is ready.

What happens for short-lived frameworks? That is, the lost task comes back, 
causing the master to track its framework as disconnected, but the framework is 
gone and will never return.

J

Re: Are there any supported systems without O_CLOEXEC?

2017-09-29 Thread James Peach

> On Sep 29, 2017, at 11:34 AM, Benjamin Mahler <bmah...@apache.org> wrote:
> 
> Is this altering the minimum Linux or OS X version we support?


I couldn't find a clear statement of what OS support we guarantee. OS X got 
O_CLOEXEC in 10.10. CentOS 6.9 has kernel 2.6.32, apparently Ubuntu 14.04 has 
3.19. Do we support anything older than that?

> 
> On Fri, Sep 29, 2017 at 9:15 AM, James Peach <jor...@gmail.com> wrote:
> 
>> 
>>> On Sep 27, 2017, at 5:03 PM, James Peach <jor...@gmail.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> In MESOS-8027 and https://reviews.apache.org/r/62638/, I'm claiming
>> that, in practice, we do not have any supported platforms that don't
>> implement O_CLOEXEC to open. All current Linux, FreeBSD and Solaris
>> versions implement O_CLOEXEC. Does anyone know of a platform that doesn't
>> have O_CLOEXEC that we ought to work on?
>>> 
>>> https://www.freebsd.org/cgi/man.cgi?sektion=2=open
>>> http://man7.org/linux/man-pages/man2/open.2.html
>>> https://docs.oracle.com/cd/E23824_01/html/821-1463/open-2.html
>>> https://developer.apple.com/legacy/library/documentation/
>> Darwin/Reference/ManPages/man2/open.2.html
>> 
>> Bump! If you run Mesos on a platform that doesn't support O_CLOEXEC (eg.
>> Linux kernel <= 2.6.23), please let us know!
>> 
>> J



Re: Are there any supported systems without O_CLOEXEC?

2017-09-29 Thread James Peach

> On Sep 27, 2017, at 5:03 PM, James Peach <jor...@gmail.com> wrote:
> 
> Hi all,
> 
> In MESOS-8027 and https://reviews.apache.org/r/62638/, I'm claiming that, in 
> practice, we do not have any supported platforms that don't implement 
> O_CLOEXEC to open. All current Linux, FreeBSD and Solaris versions implement 
> O_CLOEXEC. Does anyone know of a platform that doesn't have O_CLOEXEC that we 
> ought to work on?
> 
> https://www.freebsd.org/cgi/man.cgi?sektion=2=open
> http://man7.org/linux/man-pages/man2/open.2.html
> https://docs.oracle.com/cd/E23824_01/html/821-1463/open-2.html
> https://developer.apple.com/legacy/library/documentation/Darwin/Reference/ManPages/man2/open.2.html

Bump! If you run Mesos on a platform that doesn't support O_CLOEXEC (eg. Linux 
kernel <= 2.6.23), please let us know!

J

Re: Collect feedbacks on TASK_FINISHED

2017-09-22 Thread James Peach

> On Sep 21, 2017, at 10:12 PM, Vinod Kone  wrote:
> 
> I think it makes sense for `TASK_KILLED` to be sent in response to a KILL
> call irrespective of the exit status. IIRC, that was the original intention.

Those are the semantics we implement and expect in our scheduler and executor. 
The only time we emit TASK_KILLED is in response to a scheduler kill, and a 
scheduler kill always ends in a TASK_KILLED.

The rationale for this is
1. We want to distinguish whether the task finished for its own reasons (ie. 
not due to a scheduler kill)
2. The scheduler told us to kill the task and we did, so it was TASK_KILLED 
(irrespective of any exit status)

> On Thu, Sep 21, 2017 at 8:20 PM, Qian Zhang  wrote:
> 
>> Hi Folks,
>> 
>> I'd like to collect the feedbacks on the task state TASK_FINISHED.
>> Currently the default and command executor will always send TASK_FINISHED
>> as long as the exit code of task is 0, this cause an issue: when scheduler
>> initiates a kill task, the executor will send SIGTERM to the task first,
>> and if the task handles SIGTERM gracefully and exit with 0, the executor
>> will send TASK_FINISHED for that task, so we will see the task state
>> transition: TASK_KILLING -> TASK_FINISHED.
>> 
>> This seems incorrect because we thought it should be TASK_KILLING ->
>> TASK_KILLED, that's why we filed a ticket MESOS-7975
>>  for it. However, I am
>> not very sure if it is really a bug, because I think it depends on how we
>> define the meaning of TASK_FINISHED, if it means the task is terminated
>> successfully on its own without external interference, then I think it does
>> not make sense for scheduler to receive a TASK_KILLING followed by a
>> TASK_FINISHED since there is indeed an external interference (killing task
>> is initiated by scheduler). However, if TASK_FINISHED means the task is
>> terminated successfully for whatever reason (no matter it is killed or
>> terminated on its own), then I think it is OK to receive a TASK_KILLING
>> followed by a TASK_FINISHED.
>> 
>> Please let us know your thoughts on this issue, thanks!
>> 
>> 
>> Regards,
>> Qian Zhang
>> 



Re: TASK_FAILED - Mesos Container Images

2017-09-06 Thread James Peach

> On Sep 6, 2017, at 4:41 AM, Thodoris Zois  wrote:
> 
> Hello, 
> 
> I am using the Mesos Containerizer with Docker Images. The problem is that 
> whenever a container exits my task gets TASK_FAILED because the container 
> exits with ‘1’.
> My docker file invokes a shell script via CMD /script.sh.
> 
> My protobuff can be found below:
> https://pastebin.com/1agjAFdm
> 
> 
> The agent log can be found here:
> https://pastebin.com/Q6qndVuU
> 
> 
> The stderr and stdout from the UI can be found here:
> stdout: https://pastebin.com/SkDDUDDJ
> 
> stderr: https://pastebin.com/t2TFgQ4G

/execute_blast.sh: line 4: /proc/sys/vm/drop_caches: Read-only file system


Does your task exit when this happens?




Re: Deprecating `--disable-zlib` in libprocess

2017-08-08 Thread James Peach

> On Aug 8, 2017, at 10:57 AM, Chun-Hung Hsiao  wrote:
> 
> Hi all,
> 
> In libprocess, we have an optional `--disable-zlib` flag, but it's
> currently not used
> for conditional compilation and we always use zlib in libprocess,
> and there's a requirement check in Mesos to make sure that zlib exists.
> Should this option be removed then?

Yes.

> Or is there anyone working on a system without zlib?
> 
> Thanks for your opinions!
> Chun-Hung



Re: Command Executor

2017-08-07 Thread James Peach

> On Aug 5, 2017, at 3:03 AM, Oeg Bizz  wrote:
> 
> I have a framework that relies on information sent by a custom Java Command 
> Executor; think of some sort of heartbeat.  I start getting hearbeats after I 
> send a task to that mesos-slave, but never before that.  That makes me assume 
> that the CommandExecutor is not started until a task is submitted to be 
> executed by that agent.  Is there a way to tell mesos-slave to start the 
> ComandExecutor as soon as it starts running?

Not AFAIK. Executors are always spawned in order to execute tasks. In your 
case, what is the heartbeat for, if there are no tasks on the agent?

J

Re: Mesos-docker-executor understanding

2017-07-21 Thread James Peach

> On Jul 19, 2017, at 10:05 AM, Thomas HUMMEL  wrote:
> 
> Hello,
> 
> I've read some books about Mesos, installed one multi-master cluster (for POC 
> purposes) with some frameworks (Marathon, Spark for instance) and watch some 
> talks.
> 
> Everything works and my understanding of Mesos is becoming clearer.
> However, I'm having a hard time fully understanding this section of the 
> documentation :
> 
>  http://mesos.apache.org/documentation/latest/containerizer-internals/
> 
> Note :
> 
> - my understanding is that Mesos is now heading towards a universal 
> containerizer, which is able to understand docker (among others) images thus 
> beeing able to "do some docker" without a docker daemon.


> 
> - Also, I don't think Mesos supports nester containers yet either.

http://mesos.apache.org/documentation/latest/nested-container-and-task-group/

> 
> But I'm not really sure about what is the mesos-docker-executor, as opposed 
> to mesos-executor.
> 
> - For instance, in the "A)" case of the slave running in a container, the doc 
> states :
> 
> "if the task does not include an executor i.e. it defines a command, the 
> default executor mesos-docker-executor is launched in a docker container to 
> execute the command via Docker CLI."
> 
> Why Docker CLI ? We are not in a shell context, are we ?
> Also, will the command be launched in another docker container, different 
> from the one running mesos-docker-executor ?
> 
> - In the B) case where the slave is not running in a container, the doc 
> states :
> 
> "If task does not include an executor i.e. it defines a command, a subprocess 
> is forked to execute the default executor mesos-docker-executor. 
> mesos-docker-executor then spawns a shell to execute the command via Docker 
> CLI."
> 
> Why is the mesos-docker-executor not run in a container ? Also, why does it 
> not use docker API directly ?
> 
> Can you help me figuring out how exactly mesos-docker-executor works and what 
> its specificality relative to the mesos-executor ?

From the scheduler's POV, I'd say you don't really need to know the difference. 
In the API you need to specify the container type to switch between using the 
Docker containerized and the Mesos containerizer. For example, you can 
experiment by passing JSON to the --task option of mesos-execute:

{
  "name": "sleep",
  "agent_id": { "value": "any" },
  "task_id": {
"value": "some-unique-uuid"
  },
  "resources": [
{
  "name": "cpus",
  "type": "SCALAR",
  "scalar": {
"value": 0.4
  },
  "role": "*"
},
{
  "name": "mem",
  "type": "SCALAR",
  "scalar": {
"value": 32
  },
  "role": "*"
}
  ],
  "command": {
"value": "command-line that you want to run",
"environment": {
" variables": [
{ "name": "GLOG_v", "value": "2" }
]
}
  },
  "container": {
"type": "MESOS",
"mesos": {
  "image": {
"type": "DOCKER",
"docker": {
"name": "image/name"
}
  }
}
  }
}



Re: Format for attributes with no value

2017-07-14 Thread James Peach

> On Jul 13, 2017, at 1:41 PM, Jeff Kubina <jeff.kub...@gmail.com> wrote:
> 
> I want to know the format for an empty attribute in the list format for the 
> mesos-slave --attributes option. If I have an attribute, say key2, with no 
> value would it be "mesos-slave --attributes key1:value1;key2;key3:value3" or 
> "mesos-slave --attributes key1:value1;key2:;key3:value3 or does it not matter?

As I said before, I don't think there is a way to have an empty attribute 
value. 

$ sudo /opt/mesos/agent "--attributes=key1:value1;key2:;key3:value3"
...
F0714 06:16:49.332021 16167 attributes.cpp:145] Invalid attribute key:value 
pair 'key2:'

$ sudo /opt/mesos/agent "--attributes=key1:value1;key2;key3:value3"
...
F0714 06:17:58.916723 16297 attributes.cpp:145] Invalid attribute key:value 
pair 'key2'

Please file a bug at https://issues.apache.org/jira/projects/MESOS

> 
> -- 
> Jeff Kubina
> 410-988-4436
> 
> 
> On Tue, Jul 11, 2017 at 5:20 AM, Oeg Bizz <oegb...@yahoo.com> wrote:
> James,
>   If you need an empty attribute as default for mesos, just create an empty 
> file with the '?' in front of it and save it in the /etc/mesos- slave> directory.  For instance, if you want to enable authentication and 
> want to pass the --authenticate attribute then create an empty file called 
> /etc/mesos-master/?authenticate.
> 
> Not sure if that is what you meant with your question,
> 
> Oscar
> 
> 
> On Tuesday, July 11, 2017, 12:53:37 AM EDT, James Peach <jor...@gmail.com> 
> wrote:
> 
> 
> 
> > On Jul 7, 2017, at 4:46 PM, Jeff Kubina <jeff.kub...@gmail.com> wrote:
> > 
> > When setting an attribute with no value of a mesos-agent is the colon 
> > needed, optional, or must it be omitted? It's not clear from the 
> > documentation. For example, which line or lines below are correct?
> > 
> > att1:val1;att2;att3:val3
> > 
> > att1:val1;att2:;att3:val3
> 
> 
> I don't see a way to express an empty attribute at all :(
> 



Re: Format for attributes with no value

2017-07-10 Thread James Peach

> On Jul 7, 2017, at 4:46 PM, Jeff Kubina  wrote:
> 
> When setting an attribute with no value of a mesos-agent is the colon needed, 
> optional, or must it be omitted? It's not clear from the documentation. For 
> example, which line or lines below are correct?
> 
> att1:val1;att2;att3:val3
> 
> att1:val1;att2:;att3:val3

I don't see a way to express an empty attribute at all :(

Re: Dynamic reservations without a principal

2017-07-05 Thread James Peach

> On Jul 4, 2017, at 5:27 PM, Srikanth Viswanathan  wrote:
> 
> Hi folks,
> 
> I am trying to have the Chronos framework consume dynamic reservations in 
> Mesos. However, it appears that Chronos is unable to do this because it does 
> not pass the framework principal to Mesos when launching tasks (See 
> https://github.com/mesos/chronos/issues/843), which makes Mesos reject the 
> launch operation.
> 
> To get around this, I am considering changing my dynamic reservations to be 
> purely role-based instead of (role, principal)-based. Is this allowed/valid? 
> http://mesos.readthedocs.io/en/0.24.1/reservation/#dynamic-reservation-since-0230
>  says "resources are reserved for a role." Does this mean I can make a 
> dynamic reservation just for (role) instead of (role, principal)?

FYI the official documentation is at 
http://mesos.apache.org/documentation/latest/

Re: Framework change role

2017-07-05 Thread James Peach

> On Jul 5, 2017, at 2:39 AM, Thodoris Zois <z...@ics.forth.gr> wrote:
> 
> Ok, probably you are right but what you don’t understand is that i am a 
> completely newbie and i see such systems for the first time. It’s about a 
> university project that i am working on in order to get my bachelor degree. I 
> don’t really know the proper way to express what i want to like you do. I 
> don’t know how to connect 2 frameworks or something, i don’t even know how to 
> make my own framework and compile it with the dependencies that Mesos needs. 
> I am just writting on TestFramework.java and TestExecutor.java and that’s 
> all. The reason that i want to run everything from the same framework 
> (instance) is because keep a TreeMap with some info that i don’t want to lose 
> if i terminate the Schedulers driver. So if i start up a new framework, 
> TreeMap is gone. Forget about the tasks, i can use the same TestExecutor.java 
> for every scheduler.

Note that you can start multiple Scheduler drivers within the same server. Each 
one can register with Mesos as a separate framework.

> What i want to achieve is to get offers from a specific agent with role “SB” 
> and run 10 tasks (i give also to my Framework role “SB”), then i store those 
> information (which are actually TaskInfo in Schedulers TreeMap). After that i 
> would like to change the role of my Framework to “*(default)”. Then i will 
> get offer from an agent that has the default role and i will still have the 
> info in my TreeMap because scheduler instance didn’t stop.

If you start multiple Scheduler drivers, they can all share a TreeMap because 
they are running in the same Java process.

> That’s all. My problem is that i don’t know how to change the role of the 
> Framework without losing that TreeMap, and also how to set it with version 
> 1.3.0.  
> 
> Hope that everybody understands now….
> Thank you, and i am really sorry for the spam
> 
> 
>> On 5 Jul 2017, at 12:24, James Peach <jor...@gmail.com> wrote:
>> 
>> 
>>> On Jul 5, 2017, at 12:54 AM, Thodoris Zois <z...@ics.forth.gr> wrote:
>>> 
>>> Hi, 
>>> 
>>> No, i would like my framework to be offered resources from agent with role 
>>> (e.g: thz) and after running the specific tasks change its role to (*) in 
>>> order to get offers from different agents, but it will run the same tasks 
>>> because i am never terminating the scheduler driver (that’s what i want to).
>> 
>> As I suggested on Slack, I still think the most obvious way to implement 
>> this is to connect 2 frameworks, 1 for each role. Just co-ordinate 
>> internally to accept the offers you want in the right sequence. From your 
>> description, there's no requirement for this to be done in 1 framework.
>> 
>> I don't really follow what you mean by "run the same tasks". You can run new 
>> instances of the same task (from whatever framework you like); you can also 
>> send new tasks to an existing executor (from the same framework).
>> 
>>> Don’t try to find logic, it’s not for a company or something :)
>> 
>> I think that for people to help you, they need to be able to understand the 
>> logic of what you want to achieve and why :)
>> 
>>> 
>>> Thank you,
>>> Thodoris
>>> 
>>>> On 5 Jul 2017, at 05:36, Jay Guo <guojiannan1...@gmail.com> wrote:
>>>> 
>>>> Hi Thodoris,
>>>> 
>>>> If I understand correctly, you would like your framework to receive offers 
>>>> from both 'role' and '*', so resources reserved to 'role' on particular 
>>>> agent could be reliably supplied to the framework? Isn't it sufficient to 
>>>> start your framework with multiple roles, 'role' & '*'? You need to enable 
>>>> the capability.
>>>> 
>>>> - J
>>>> 
>>>> On Wed, Jul 5, 2017 at 7:28 AM, Thodoris Zois <z...@ics.forth.gr> wrote:
>>>> I have built a Framework in Java that is running certain tasks. I would 
>>>> like to run those tasks on a specific agent. I have set a role to the 
>>>> Framework and used flags upon starting of the agent. Till here everything 
>>>> is good. When framework has run tasks successfully i am not terminating 
>>>> it. I would like to change its role to default (*) and get offered 
>>>> resources from master that correspond to that role and it will run again 
>>>> the same amount of tasks (and the same tasks) because i never terminated 
>>>> (and i don't want to terminate its instance because i keep some mesos 
>>>> metrics to

Re: Framework change role

2017-07-05 Thread James Peach

> On Jul 5, 2017, at 12:54 AM, Thodoris Zois  wrote:
> 
> Hi, 
> 
> No, i would like my framework to be offered resources from agent with role 
> (e.g: thz) and after running the specific tasks change its role to (*) in 
> order to get offers from different agents, but it will run the same tasks 
> because i am never terminating the scheduler driver (that’s what i want to).

As I suggested on Slack, I still think the most obvious way to implement this 
is to connect 2 frameworks, 1 for each role. Just co-ordinate internally to 
accept the offers you want in the right sequence. From your description, 
there's no requirement for this to be done in 1 framework.

I don't really follow what you mean by "run the same tasks". You can run new 
instances of the same task (from whatever framework you like); you can also 
send new tasks to an existing executor (from the same framework).

> Don’t try to find logic, it’s not for a company or something :)

I think that for people to help you, they need to be able to understand the 
logic of what you want to achieve and why :)

> 
> Thank you,
> Thodoris
> 
>> On 5 Jul 2017, at 05:36, Jay Guo  wrote:
>> 
>> Hi Thodoris,
>> 
>> If I understand correctly, you would like your framework to receive offers 
>> from both 'role' and '*', so resources reserved to 'role' on particular 
>> agent could be reliably supplied to the framework? Isn't it sufficient to 
>> start your framework with multiple roles, 'role' & '*'? You need to enable 
>> the capability.
>> 
>> - J
>> 
>> On Wed, Jul 5, 2017 at 7:28 AM, Thodoris Zois  wrote:
>> I have built a Framework in Java that is running certain tasks. I would like 
>> to run those tasks on a specific agent. I have set a role to the Framework 
>> and used flags upon starting of the agent. Till here everything is good. 
>> When framework has run tasks successfully i am not terminating it. I would 
>> like to change its role to default (*) and get offered resources from master 
>> that correspond to that role and it will run again the same amount of tasks 
>> (and the same tasks) because i never terminated (and i don't want to 
>> terminate its instance because i keep some mesos metrics to a static 
>> TreeMap). That's all.. I just wanted somebody to explain me exactly how it 
>> works and what i have to do because everything i have tried today fails, and 
>> seems i can't find useful info on the Internet about this. 
>> 
>> Thank you!
>> 
>> On 4 Jul 2017, at 21:00, Michael Park  wrote:
>> 
>>> What is it that you need help with?
>>> 
>>> On Tue, Jul 4, 2017 at 11:12 AM Thodoris Zois  wrote:
>>> Hello list,
>>> 
>>> Is anybody available to help me with the new feature of 1.3.0 version, that 
>>> a framework can modify its role?
>>> 
>>> Thank you
>> 
> 



Re: ensuring a particular task is deployed to "all" Mesos Worker hosts

2017-07-01 Thread James Peach

> On Jul 1, 2017, at 11:14 AM, Erik Weathers  wrote:
> 
> Thanks for the info Kevin.  Seems there's no JIRAs nor design docs floating 
> about yet for "admin tasks" or "daemon sets".
> 
> Just FYI, this is the ticket in Storm for the problem I've been mentioning:
> 
> https://issues.apache.org/jira/browse/STORM-1342
> 
> I'll update it with the info you've provided below, so for now we'll rely on 
> manually deploying logviewers.

ISTM that this should be possible with a smart framework. If the framework 
keeps track of which agents it gets offers for, it could ensure that it 
launches a Storm logviewer task on the agent before launching any other Storm 
tasks. I expect that it might be a little tricky to get the containerization 
right so that the Storm tasks can rendezvous with the logviewer, but in 
principle it could be made to work?

> 
> Thanks!
> 
> - Erik
> 
> On Sat, Jul 1, 2017 at 10:09 AM Kevin Klues  wrote:
> What you are describing is a feature we call 'admin tasks' or 'daemon sets'. 
> 
> Unfortunately, there is no direct support for these yet, but we do have plans 
> in the (relatively) near future to start working on it.
> 
> One of our use cases is exactly what you describe with the logging service. 
> On DC/OS we currently run our logging service as a systemd unit outside of 
> mesos since we can't guarantee it gets launched everywhere (the same is true 
> for a bunch of other services as well, namely metrics).
> 
> We don't have an exact timeline for when we will build this support yet, but 
> we will certainly announce it once we start actively working on it.
> 
> 
> Erik Weathers  schrieb am Sa. 1. Juli 2017 um 09:45:
> That works for our particular use case, and is effectively what *we* do, but 
> renders storm a "strange bird" amongst mesos frameworks.  Is there no 
> trickery that could be played with mesos roles and/or reservations?
> 
> - Erik
> 
> On Sat, Jul 1, 2017 at 3:57 AM Dick Davies  wrote:
> If it _needs_ to be there always then I'd roll it out with whatever
> automation you use to deploy the mesos workers ; depending on
> the scale you're running at launching it as a task is likely to be less
> reliable due to outages etc.
> 
> ( I understand the 'maybe all hosts' constraint but if it's 'up to one per
> host', it sounds like a CM issue to me. )
> 
> On 30 June 2017 at 23:58, Erik Weathers  wrote:
> > hi Mesos folks!
> >
> > My team is largely responsible for maintaining the Storm-on-Mesos framework.
> > It suffers from a problem related to log retrieval:  Storm has a process
> > called the "logviewer" that is assumed to exist on every host, and the Storm
> > UI provides links to contact this process to download logs (and other
> > debugging artifacts).   Our team manually runs this process on each Mesos
> > host, but it would be nice to launch it automatically onto any Mesos host
> > where Storm work gets allocated. [0]
> >
> > I have read that Mesos has added support for Kubernetes-esque "pods" as of
> > version 1.1.0, but that feature seems somewhat insufficient for implementing
> > our desired behavior from my naive understanding.  Specifically, Storm only
> > has support for connecting to 1 logviewer per host, so unless pods can have
> > separate containers inside each pod [1], and also dynamically change the set
> > of executors and tasks inside of the pod [2], then I don't see how we'd be
> > able to use them.
> >
> > Is there any existing feature in Mesos that might help us accomplish our
> > goal?  Or any upcoming features?
> >
> > Thanks!!
> >
> > - Erik
> >
> > [0] Thus the "all" in quotes in the subject of this email, because it
> > *might* be all hosts, but it definitely would be all hosts where Storm gets
> > work assigned.
> >
> > [1] The Storm-on-Mesos framework leverages separate containers for each
> > topology's Supervisor and Worker processes, to provide isolation between
> > topologies.
> >
> > [2] The assignment of Storm Supervisors (a Mesos Executor) + Storm Workers
> > (a Mesos Task) onto hosts is ever changing in a given instance of a
> > Storm-on-Mesos framework.  i.e., as topologies get launched and die, or have
> > their worker processes die, the processes are dynamically distributed to the
> > various Mesos Worker hosts.  So existing containers often have more tasks
> > assigned into them (thus growing their footprint) or removed from them (thus
> > shrinking the footprint).
> -- 
> ~Kevin



Re: Mesos-Metrics per task

2017-06-29 Thread James Peach

> On Jun 29, 2017, at 3:53 PM, Thodoris Zois  wrote:
> 
> Hello, i would like to get some metrics per task. E.g memory/cpu usage is 
> there any way? 
> 
> Thank you! 

You can use the GET_CONTAINERS agent API call 
 to get 
resource usage for a container, then match up the container to a task by using 
other master and agent APIs to match the framework ID and executor ID.

J

Re: Agent Working Directory Best Practices

2017-06-26 Thread James Peach

> On Jun 26, 2017, at 4:05 PM, Steven Schlansker  
> wrote:
> 
> 
>> On Jun 25, 2017, at 11:24 PM, Benjamin Mahler  wrote:
>> 
>> As a data point, as far as I'm aware, most users are using a local work 
>> directory, not an NFS mounted one. Would love to hear from anyone on the 
>> list if they are doing this, and if there are any subtleties that should be 
>> documented.
> 
> We don't run NFS in particular but we did originally use a SAN -- two 
> observations:
> 
> NFS (historically, maybe it's better now, but doubtful...) has really bad 
> failure modes.
> Network failures can cause serious hangs both in user-space and kernel-space. 
>  Such
> hangs can be impossible to clear without rebooting the machine, and in some 
> edge cases
> can even make it difficult or impossible to reboot the machine via normal 
> means.

You need to make sure to mount with the "intr" option.

https://speakerdeck.com/gnb/130-lca2008-nfs-tuning-secrets-d7

> 
> Network attached drives (our SAN) are less reliable, slower, and more complex
> (read: more failure modes) than local disk.  It's also a really big single 
> point
> of failure.  So far our only true cluster outages have been due to failure of
> the SAN, since it took down all nodes at once -- once we removed the SAN, 
> future
> failures had islands of availability and any properly written application
> could continue running (obviously without network resources) through the 
> incident.
> 
> Maybe this isn't a huge deal for your use case, which might differ from ours.
> For us, it was enough of a problem that we now purchase local SSD scratch 
> space
> for every node just so that we have some storage we can depend on a bit more
> than network attached storage.
> 
>> 
>> On Thu, Jun 22, 2017 at 11:13 PM,  wrote:
>> Hi,
>> 
>> We have a couple of server nodes mainly used for computational tasks in
>> our mesos cluster. These servers have beefy cpus, gpus etc. but only
>> limited ssd space. We also have a 40GBe network and a decently fast
>> file server.
>> 
>> My question is simple but I didnt find an answer anywhere: What are the
>> best practices for the working directory on mesos-agent nodes? Should
>> we keep the working directory local or is it reasonable to use a nfs
>> mounted folder? We implemented both and they seem to work fine, but I
>> would rather like to follow "best practices".
>> 
>> Thanks and cheers
>> 
>> Tom
>> 
> 



Re: Work group on Community

2017-06-16 Thread James Peach

> On Jun 15, 2017, at 10:57 AM, Vinod Kone  wrote:
> 
> Hi folks,
> 
> Seeing that our first official containerizer WG is off to a good start, we
> want to use that momentum to start new WGs.
> 
> I'm proposing that we start a new work group on community. The mission of
> this work group would be to figure out ways to grow the size of our
> community and improve the experience of community members (users, devs,
> contributors, committers etc).
> 
> In the first meeting, we can nail down what the charter of this work group
> should be etc. My initial ideas for the topics/components this work group
> could cover
> 
> --> Releases
> --> Roadmap
> --> Reviews
> --> JIRA
> --> CI
> 
> Over time, I'm hoping that new specific work groups will sprung up that can
> own some of these topics.
> 
> If you are interested in joining this work group, please reply to this
> thread and I'll add you to the invite.

I'm interested, but unlikely to have much bandwidth to contribute anything 
substantial. One suggestion I have is that a Mesos Weekly news would be pretty 
great. There is a lot of activity on reviewboard, slack and in design documents 
and collecting that in a regular newsletter would give that activity a lot more 
visibility.

J

Re: How to filter GET_TASKS api result

2017-04-19 Thread James Peach

> On Apr 19, 2017, at 5:00 PM, Benjamin Mahler  wrote:
> 
> We can add a Call.GetTasks message to allow you to specify which task ids you 
> would like to retrieve. But this isn't supported yet, the code needs to be 
> written. E.g.
> 
> message Call {
>   enum Type {
> GET_TASKS = 13; // Retrieves the information about tasks, see 
> `GetTasks` below.
>   }
> 
>   message GetTasks {
> // Which tasks to retrieve, leave empty to retrieve all tasks.
> repeated TaskID task_ids;
>   }
> }

See also https://issues.apache.org/jira/browse/MESOS-6935. It makes sense to be 
able to ask for specific FrameworkIDs too.

> 
> On Thu, Apr 6, 2017 at 8:31 PM, 梦开始的地方 <382607...@qq.com> wrote:
> 
> but spark and chronos has too many short tasks,get all task is too slow.
> 
> -- 原始邮件 --
> 发件人: "Alexander Rojas";;
> 发送时间: 2017年4月3日(星期一) 晚上9:47
> 收件人: "user";
> 主题: Re: How to filter GET_TASKS api result
> 
> Hi,
> 
> Mesos does not have a way to get info about a single task, however the answer 
> should be pretty easy to filter so you can search for the task you’re looking 
> for.
> 
> Alexander Rojas
> alexan...@mesosphere.io
> 
> 
> 
> 
>> On 20 Mar 2017, at 10:35, 梦开始的地方 <382607...@qq.com> wrote:
>> 
>> Hi,I'd like to use the GET_TASKS api get specific task ,but the api return 
>> all tasks.
>> please help me,thanks
>> 
> 
> 



Re: Structured logging for Mesos (or c++ glog)

2016-12-19 Thread James Peach

> On Dec 19, 2016, at 2:54 PM, Zhitao Li <zhitaoli...@gmail.com> wrote:
> 
> Hi James,
> 
> Stitching events together is only one possible use cases, and I'm not exactly 
> sure what you meant by directly event logging.
> 
> Taking the hierarchical allocator for example. In a multi-framework cluster, 
> sometimes I want to comb through various loggings and present a trace on how 
> allocation has affected a particular framework (by its framework id) and/or 
> w.r.t an agent (by its agent id).
> 
> Being able to systematically extract structured field values like 
> framework_id or agent_id, regardless of the actually logging pattern, will be 
> tremendously automatically from all lo valuable in such use cases.

I think we are talking about similar things. Many servers do both free-form 
error logging and structured event logging. I'm thinking of event logging 
formats are customizable by the operator and allow the interpolation of 
context-specific data item (eg. HTTP access logs from many different server 
implementations).

J

Marathon API and support for pods

2016-09-21 Thread James DeFelice
Hi folks,

First of all the Marathon team would like to thank those who provided
feedback on the v3 API proposal (linked below) that was circulated last
month. Developing a new API for Marathon is a big undertaking and getting
your feedback early in the process has been helpful.

"A vision for pods in Marathon
[
https://docs.google.com/document/d/1uPH58NWN_OuynptsqTOq8v5qlkivq2mUb2M9J5jZTMU/edit#heading=h.sqydeepp9s4m
]

We've done some additional discovery over the several weeks and have made
some changes to our API roadmap. It takes time to get an API right and the
decisions we make now will have lasting impact for months/years to come;
let's ensure that we spend enough time now thinking through the long-term
implications of particular API design choices. We'd also like to facilitate
a deeper discussion with the community about what problems v3 should solve
before committing to API decisions. The current plan is to resume work on
the v3 API later this fall. Stay tuned for additional announcements
regarding v3 API proposals.

Furthermore, the v3 API is about more than just pods. We, the team and
community at large, should ensure that a new API is conceptually consistent
across API types and satisfies not only our short-term goals, but is
forward-compatible with our long term roadmap.

That said, we have customer demand for pods now! In order to deliver pods
functionality without committing to a v3 API the team has decided to
introduce pods via a new /v2/pods API endpoint. What does this mean for v2
API users?

First, if your organization doesn't have an interest in pods then nothing
forces you to change. Additional support for pods in the Marathon v2 API
should not cause breakage for existing v2 API users.

Second, by integrating with the existing v2 API and backend we'll be
minimizing overall architectural changes. Needless to say there will be
changes to backend components that had been previously optimized for single
tasks vs. task groups. But the overall architecture of the system will not
change. This is important in order to preserve the performance,
scalability, and stability gains that Marathon has recently made.

In addition, introducing pods in v2 allows the Marathon team to gather
early feedback from the community and our customers about how the API does
and does not meet their needs. This is very valuable input that will help
to shape the future v3 API.

Below is a link to a proposal for pods in the Marathon v2 API. This initial
implementation for pods support should be viewed as an MVP that will be
enhanced in coming releases. Your feedback is most welcome and strongly
encouraged. Please comment directly in the document with any questions or
concerns.

"Marathon: Pods in v2"
[
https://docs.google.com/document/d/1Zno6QK2yGF4koB8BYT88EtB2-T7t3aAYRQ27pUD76gw/edit#heading=h.ywxj299mstr7
]

Many thanks,

the Marathon team.


A vision for pods in Marathon

2016-08-13 Thread James DeFelice
Hi folks,

Hot on the heels of the Mesos announcement of support for "task groups" the
Marathon team would like to share an API road map for pods in Marathon
(google-doc linked at bottom).

We are requesting feedback from the community about both our approach to
pods and, more generally, the style of the next, v3 API. It will likely
take some time to fully specify a v3 API and the attached document
represents only a fraction of what such an API might look like. We've
mocked out a few possible REST API endpoints (specific to pods), but this
document is very much a Work In Progress.

We want your input! We deeply value contributions from our community and
look forward to hearing from you.

https://docs.google.com/document/d/1uPH58NWN_OuynptsqTOq8v5qlkivq2mUb2M9J5jZTMU/edit#


Re: Persistent volume ownership issue

2016-06-21 Thread James Peach
On 21 June 2016 at 12:25, Jie Yu <yujie@gmail.com> wrote:
> James, sticky bit means that there will be no write sharing between two
> users even if the underlying permission allows it. I'd prefer not having
> this restriction:)

No, it just prevents users renaming or deleting each others files.

http://man7.org/linux/man-pages/man1/chmod.1.html

If you want multiple users to be able to write to the same files, they
need to create with the right ownership.

>> I wonder whether ACLs are the right solution to volume ownership?
>> Certainly I think inherited ACLs are a good solution for expressing a
>> consistent access control policy over a hierarchy (at least in the
>> Windows/Darwin/SMB/NFSv4/RichAcl ACL model).
>
>
> Are you suggesting that we don't expose the underlying unix user directly to
> frameworks. Instead, expressing permissions and ownerships using ACLs?

Well that could be an option, though I'm mainly thinking out loud.
With shared volumes, it seems like you really want an access control
policy that applies to the volume, rather than requiring processes to
collaborate at a file granularity. One way to do that would be to make
the owner the creator of the volume, then use ACL inheritance to grant
additional access to other users. You'd have to reflow the
inheritance, but it could probably done.

-- 
James Peach | jor...@gmail.com


Re: Persistent volume ownership issue

2016-06-21 Thread James Peach
Non-recursive chown is an improvement over recursive chown which seems
fraught and should be avoided. For an interim fix, could you make the
volume root world writeable with the sticky bit set? Then you wouldn't
have to chown and volume users would still be able to create files.

I wonder whether ACLs are the right solution to volume ownership?
Certainly I think inherited ACLs are a good solution for expressing a
consistent access control policy over a hierarchy (at least in the
Windows/Darwin/SMB/NFSv4/RichAcl ACL model).

On 20 June 2016 at 23:25, Jie Yu <yujie@gmail.com> wrote:
> Hi folks,
>
> Currently, the ownership of the persistent volumes are set to be the same as
> the sandbox. In the implementation, we call `chown -R` on the persistent
> volume to match that of the sandbox each time before we mount it into the
> container.
>
> Recently, we realized that this behavior is not ideal. Especially, if a task
> created some files in the persistent volume, and the owner of those file
> might be different than the task's user. For instance, a task is running
> under root and it creates some database files under user 'database' and
> launch the database process under user 'database'. When the database process
> is restarted by the scheduler, the current behavior is that the we'll do a
> 'chown -R root.root' on the persistent volume, causes database files to be
> chown to 'root'.
>
> The true fix of this problem is to allow frameworks to explicit specify
> owner of persistent volumes during creation. THis is captured in this
> ticket:
> https://issues.apache.org/jira/browse/MESOS-4893
>
> In the short-term (for 1.0), I propose that, instead of doing a recursive
> chown, we do a non-recursive chown. That'll allow the new task to at least
> create new files under the persistent volume, but do not change ownership of
> files created by previous tasks. It should be a very simple fix which we can
> ship in 1.0. We'll ship MESOS-4893 after 1.0. What do you guys think?
>
> Thanks,
> - Jie



-- 
James Peach | jor...@gmail.com


Re: New external dependency

2016-06-18 Thread james

Hello Kevin,

On gentoo, I have this version::

dev-libs/libelf-0.8.13-r2

Which looks to be a bit newer, due to the rolling release updates
of gentoo. It should not matter if a new version is used, right?
Version 0.8.9 (stable) released on 22 August 2006, seem to be common 
among binary distros.



Note, Version 0.8.12 and later work with some of the advanced Kernel 
performance tools..


hth,
James




On 06/18/2016 12:25 PM, Kevin Klues wrote:

Hello all,

Just an FYI that the newest libmesos now has an external dependence on
libelf on Linux. This dependence can be installed via the following
packages:

CentOS 6/7: yum install elfutils-libelf.x86_64
Ubuntu14.04:   apt-get install libelf1

Alternatively you can install from source:
https://directory.fsf.org/wiki/Libelf

For developers, you will also need to install the libelf headers in
order to build master. This dependency can be installed via:

CentOS: elfutils-libelf-devel.x86_64
Ubuntu: libelf-dev

Alternatively, you can install from source:
https://directory.fsf.org/wiki/Libelf

The getting started guide and the support/docker_build.sh scripts have
been updated appropriately, but you may need to update your local
environment if you don't yet have these packages installed.





Re: Rack awareness support for Mesos

2016-06-15 Thread james

@Joris,


OK. Now I understand where you are coming from. As soon as I get some 
time, I'll join that design discussion. Thanks for the clarifications.


James





On 06/15/2016 02:45 AM, Joris Van Remoortere wrote:

Since your interest is in the determination of the values, as
opposed to

their propagation, I would just urge that you keep in mind that
we may

(as a project) not want to support this information as the current

string attributes.


Huh? Why not? If the attributes change, why can't this sub-project
just change with those changing string attributes? Maybe some
elaboration how this might not naturally be able to evolve is a
warranted detail of discussion?


Sorry, I should clarify what I meant by support. By support I mean that
we may not want to promise that those values will be there (support as a
feature), and what schemas are mangled into the random strings that we
currently call attributes. I did not mean that we wouldn't allow users
to inject their own values if they wanted to. We just wouldn't control
the standard or schema as a project and therefore couldn't support it.

Any random collection of strings that has previously had no reserved
keywords is notoriously difficult to build new schemas in.
This is why we may want to instead introduce a typed structure that is
dedicated to fault domain information. This:

  * Prevents us from colliding with current users' attributes.
  * Allows us to have more control over the types (YAY) and ranges of
values.
  * Allows us to introduce explicit structure such as dependency or
hierarchy.

The fact that users have already encoded information in attributes is
not a reason for us to limit ourselves to that scope when better
structures may be available. This is why we shouldn't assume that the
project will *provide support for* (as opposed to allow users to) using
attributes.

As your said, it is their prerogative to join the design discussion to
ensure that any formalized structure or schema we introduce is one that
they are agreeable with.



—
*Joris Van Remoortere*
Mesosphere

On Tue, Jun 14, 2016 at 6:31 PM, james <gar...@verizon.net
<mailto:gar...@verizon.net>> wrote:

On 06/14/2016 08:14 AM, Joris Van Remoortere wrote:

On the condition of compatible with existing framework which
already rely on parsing attributes for rack information.

There is currently nothing in Mesos that specifies the format or
structure for rack information in attributes.
The fact that operators / frameworks have decided to add this
information out of band is their problem to solve.
We don't need to be backwards compatible with something we never
published to begin with. This is why it's ok for us to consider
adding a
typed form of failure domain information that is separate from the
typeless string attributes.


True. But you have to start somewhere, know that the schema and
codes will morph over time to maintain relevance  and usefulness. In
that vein, if folks have established interesting and useful
parameters for this work, then it is most beneficial that those
methods and codes are considered carefully.  AKA:: speak up now.
Diversity and inclusion are keenly beneficial, where practical.


Since your interest is in the determination of the values, as
opposed to
their propagation, I would just urge that you keep in mind that
we may
(as a project) not want to support this information as the current
string attributes.


Huh? Why not? If the attributes change, why can't this sub-project
just change with those changing string attributes? Maybe some
elaboration how this might not naturally be able to evolve is a
warranted detail of discussion?


I would venture that both 'determination of the values and
propagation (delays)' are inherently important in a cluster of many
things:: hardware, resources, frameworks, security codes, etc etc.
The author
and others seem to be keenly aware that a tight focus is not going
to work, at this stage, so a broad appeal to a multitude of needs is
best.
And in fact, until some idea is proven to be useless or too difficult to
implement, the bigger the tent, the more useful the codes that
define this project/idea become.  Personally, I'm very excited that
someone has stepped up in this area; hoping they keep an open mind
and flexibility geared toward multiplicative usage, in the future.
Most mature hardware folks who build ideas into robust systems do
exactly that, to motivate a multiplicative usage for organizing
hardware, performance and state metrics, and timing signals,
gregariously. All of this is routine semantics from a hardware
perspective.

At some point, folks will realize that kernel c

Re: Rack awareness support for Mesos

2016-06-14 Thread james

On 06/14/2016 08:14 AM, Joris Van Remoortere wrote:

On the condition of compatible with existing framework which already rely on 
parsing attributes for rack information.

There is currently nothing in Mesos that specifies the format or
structure for rack information in attributes.
The fact that operators / frameworks have decided to add this
information out of band is their problem to solve.
We don't need to be backwards compatible with something we never
published to begin with. This is why it's ok for us to consider adding a
typed form of failure domain information that is separate from the
typeless string attributes.


True. But you have to start somewhere, know that the schema and codes 
will morph over time to maintain relevance  and usefulness. In that 
vein, if folks have established interesting and useful parameters for 
this work, then it is most beneficial that those methods and codes are 
considered carefully.  AKA:: speak up now. Diversity and inclusion are 
keenly beneficial, where practical.




Since your interest is in the determination of the values, as opposed to
their propagation, I would just urge that you keep in mind that we may
(as a project) not want to support this information as the current
string attributes.


Huh? Why not? If the attributes change, why can't this sub-project just 
change with those changing string attributes? Maybe some elaboration how 
this might not naturally be able to evolve is a warranted detail of 
discussion?



I would venture that both 'determination of the values and propagation 
(delays)' are inherently important in a cluster of many things:: 
hardware, resources, frameworks, security codes, etc etc. The author
and others seem to be keenly aware that a tight focus is not going to 
work, at this stage, so a broad appeal to a multitude of needs is best.

And in fact, until some idea is proven to be useless or too difficult to
implement, the bigger the tent, the more useful the codes that define 
this project/idea become.  Personally, I'm very excited that someone has 
stepped up in this area; hoping they keep an open mind and flexibility 
geared toward multiplicative usage, in the future. Most mature hardware 
folks who build ideas into robust systems do exactly that, to motivate a 
multiplicative usage for organizing hardware, performance and state 
metrics, and timing signals, gregariously. All of this is routine 
semantics from a hardware perspective.


At some point, folks will realize that kernel configuration, testing and 
tweaks are critical to cluster performance, regardless of the codes

running on top of the cluster. So this project could easily use cgroups
and such for achieve robustness in many areas of need.


Like it or not large amounts of hardware, need to have schema, planning 
and architectural robustness to keep large amounts of hardware, 
pristinely  available for software efficiency to be any where near 
optimal deployment. This really becomes critical when the mix of 
different CPU types, GPUs and ram are to be considered in future 
deployments, regardless if you outsource or run your own cluster. 
Hardware vendors are going to want to sell their products to as wide of 
a customer base a possible and customers are going to demand seamless 
management for expansion of resources. Furthermore, as a consultant my 
experiences are that much of the future market is going to demand 
outsourced, hybrid and in-house options as a fundamental tenant of 
cluster resource adoption.


hth,
James



*Joris Van Remoortere*
Mesosphere

On Tue, Jun 14, 2016 at 3:02 PM, Du, Fan <fan...@intel.com
<mailto:fan...@intel.com>> wrote:



On 2016/6/14 20:32, Joris Van Remoortere wrote:

 #1. Stick with attributes for rack awareness

I don't think this is the right approach; however, there seem to
be 2
components to this discussion:

1. How the values are presented (Attributes vs. a new type-aware
structure)
2. How the values are determined (scripts vs. automation vs.
modules)

It seems you are more interested in working on #2. If that's the
case,
please make sure that you don't assume anything about #1, as we not
everyone agrees that we will use the existing attributes in the
future.


On the condition of compatible with existing framework which already
rely on parsing attributes for rack information.

Quotes from my original statements:
> For compatibility with existing framework, I tend to be ok with using
> attributes to convey the rack information

By all means, no matter what internal structures to use, current
behavior should be honored. btw, I'm also thinking about #1, it's
too earlier to bring up the details so far before the ticket got
ACCEPTED.

Any way, I'm always open to all kind of discussion, thanks for your
comments! Joris.

For #2, you should focus on an API

Re: Mesos 0.24.1 on Raspberry Pi 3

2016-06-07 Thread james

@ Spyker
I found these $40/each arm64v8 board, with 2 G of ram each. It would be 
keen if we many folks interested in mesos on arm64v8, could agree at 
least on a low cost dev board to work on for mesos development, imho.

I'd really like embedded arm64v8 boards with 4 gig of ram.

https://archlinuxarm.org/platforms/armv8/amlogic/odroid-c2

http://www.cnx-software.com/2016/02/29/odroid-c2-64-bit-arm-development-board-is-now-available-for-purchase-for-40/

@ Joris
This is all awesome news. Sure there is a time for systemd. But, for 
now, bare metal development and rDMA/compiler issues and arm64vi is 
enough issues to work on with mesos for HPC. Surely it has me 
overwhelmed because dual work on x86_64 and the GPU issues mandates that 
I work on 2 different platforms. (thanks).


On 06/07/2016 04:15 PM, Andrew Spyker wrote:

FYI ... We were not able to compile latest master or 0.28.1.  What we
saw was that the linking step ran out of memory - well beyond the 1G of
physical and 1G of swap.  We considered some linking options to trade
off memory, but haven't tried them.





On Tue, Jun 7, 2016 at 12:18 PM, Joris Van Remoortere
<jo...@mesosphere.io <mailto:jo...@mesosphere.io>> wrote:

All versions of mesos *should* work without systemd. The intent was
to add *support* for systemd, not make it a requirement.
If specific version of mesos *don't* work without systemd then that
is a bug, and it would be awesome if you could share specific issues
(we can make JIRAs).

The purpose of the `systemd_enable_support`flag was to prevent mesos
from thinking it should use systemd utilities when systemd was
available on the system (and therefore Mesos assumes it's being
launched as a systemd unit).

I want to make it very clear that there is no intent to make systemd
a requirement :-) We would need to have a significant conversation
in the community first if that were the case.






I really enjoyed hearing this progress, so please do ping me on any
JIRAs where systemd made this project more difficult!

Joris

—
*Joris Van Remoortere*
Mesosphere

On Tue, Jun 7, 2016 at 1:01 PM, james <gar...@verizon.net
<mailto:gar...@verizon.net>> wrote:

Just the opposite, I'm mostly interested in mesos without systemd
on bard metal, minimized linux systems. So with that temporal
requirement, what is the latest version of mesos that one can run
without systemd?

James


On 06/07/2016 10:35 AM, Joris Van Remoortere wrote:

It should be straightforward to apply the patch that adds the
`systemd_enable_support` flag to older releases.
Let me know if you need help!

—
*Joris Van Remoortere*
Mesosphere

On Tue, Jun 7, 2016 at 11:28 AM, haosdent
<haosd...@gmail.com <mailto:haosd...@gmail.com>
<mailto:haosd...@gmail.com <mailto:haosd...@gmail.com>>> wrote:

 No, it is mandatory in 0.25. `systemd_enable_support`
is added since
 0.27 https://issues.apache.org/jira/browse/MESOS-4675

 On Tue, Jun 7, 2016 at 11:21 PM, Jan Schlicht
<j...@mesosphere.io <mailto:j...@mesosphere.io>
 <mailto:j...@mesosphere.io <mailto:j...@mesosphere.io>>>
wrote:

 It's not mandatory. There's the
`systemd_enable_support` flag to
 enable some systemd related features on an agent
but it can be
 disabled.

 Cheers,
     Jan

 On Tue, Jun 7, 2016 at 3:55 PM, james
<gar...@verizon.net <mailto:gar...@verizon.net>
 <mailto:gar...@verizon.net
<mailto:gar...@verizon.net>>> wrote:


 I thought systemd was not mandatory in version
0.25 and later?

 James


 On 06/07/2016 07:42 AM, tommy xiao wrote:

 only 0.24 can work on it. 0.25 use systemd
and can't
 ignore it.

 2016-06-07 7:50 GMT+08:00 Benjamin Mahler
 <bmah...@apache.org
<mailto:bmah...@apache.org> <mailto:bmah...@apache.org
<mailto:bmah...@apache.org>>
 <mailto:bmah...@apache.org
<mailto:bmah...@apache.org> <mailto:bmah...@apache.org
<mailto:bmah...@apache.org>>>>:

  Cool stuff Andrew, thanks for sharing!

  On Thu, Jun 2, 2016 at 11:50 AM,
Andrew Spyker
  <aspy...@netf

Re: Mesos 0.24.1 on Raspberry Pi 3

2016-06-07 Thread james

Just the opposite, I'm mostly interested in mesos without systemd
on bard metal, minimized linux systems. So with that temporal 
requirement, what is the latest version of mesos that one can run

without systemd?

James


On 06/07/2016 10:35 AM, Joris Van Remoortere wrote:

It should be straightforward to apply the patch that adds the
`systemd_enable_support` flag to older releases.
Let me know if you need help!

—
*Joris Van Remoortere*
Mesosphere

On Tue, Jun 7, 2016 at 11:28 AM, haosdent <haosd...@gmail.com
<mailto:haosd...@gmail.com>> wrote:

No, it is mandatory in 0.25. `systemd_enable_support` is added since
0.27 https://issues.apache.org/jira/browse/MESOS-4675

On Tue, Jun 7, 2016 at 11:21 PM, Jan Schlicht <j...@mesosphere.io
<mailto:j...@mesosphere.io>> wrote:

It's not mandatory. There's the `systemd_enable_support` flag to
enable some systemd related features on an agent but it can be
disabled.

Cheers,
Jan

On Tue, Jun 7, 2016 at 3:55 PM, james <gar...@verizon.net
<mailto:gar...@verizon.net>> wrote:


I thought systemd was not mandatory in version 0.25 and later?

James


On 06/07/2016 07:42 AM, tommy xiao wrote:

only 0.24 can work on it. 0.25 use systemd and can't
ignore it.

2016-06-07 7:50 GMT+08:00 Benjamin Mahler
<bmah...@apache.org <mailto:bmah...@apache.org>
<mailto:bmah...@apache.org <mailto:bmah...@apache.org>>>:

 Cool stuff Andrew, thanks for sharing!

 On Thu, Jun 2, 2016 at 11:50 AM, Andrew Spyker
 <aspy...@netflix.com.invalid>
 wrote:

  > FYI, based on the work others have done in the
past, Netflix was
 able to
  > get Mesos agent building and running on
Raspberry Pi natively and
 under
  > Docker containers.  Please see this blog for the
information:
  >
  > bit.ly/TitusOnPi <http://bit.ly/TitusOnPi>
<http://bit.ly/TitusOnPi>
  >
  > --
  > Andrew Spyker (aspy...@netflix.com
<mailto:aspy...@netflix.com> <mailto:aspy...@netflix.com
<mailto:aspy...@netflix.com>>)
  > Twitter:  @aspyker  Blog: ispyker.blogspot.com
<http://ispyker.blogspot.com>
 <http://ispyker.blogspot.com>
  >




--
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com <http://gmail.com>
<http://gmail.com>





--
*Jan Schlicht*
Distributed Systems Engineer, Mesosphere




--
Best Regards,
Haosdent Huang






Re: Rack awareness support for Mesos

2016-06-07 Thread james

On 06/07/2016 09:57 AM, Du, Fan wrote:



On 2016/6/6 21:27, james wrote:

Hello,


@Stephen::I guess Stephen is bringing up the 'security' aspect of who
get's access to the information, particularly cluster/cloud devops,
customers or interlopers?


ACLs should play in this part to address security concern.


YES, and so much more! I know folks that their primary (in house 
cluster) usage is deep packet inspection on  the cluster

With a cluster (inside) there is no limit to new tools that can be
judiciously altered to benefit from cluster codes





@Fan:: As a consultant, most of my customers either have  or are
planning hybrid installations, where some codes run on a local cluster
or using 'the cloud' for dynamic load requirements. I would think your
proposed scheme needs to be very flexible, both in application to a
campus or Metropolitan Area Network, if not massively distributed around
the globe. What about different resouce types (racks of arm64, gpu
centric hardware, DSPs, FPGA etc etc. Hardware diversity bring many
benefits to the cluster/cloud capabilities.


This also begs the quesion of hardware management (boot/config/online)
of the various hardware, such as is built into coreOS. Are several
applications going to be supported? Standards track? Just Mesos DC/OS
centric?


It depends whether this proposal is accepted by Mesos, if you think
this feature is useful, let's discuss detailed requirement under
MESOS-5545.


OK. Take a look at 'Rackview' on sourceforge::
'http://rackview.sourceforge.net/'


Do I have access to the jira system by default joining this list,
or do I have to request permission somewhere? (sorry jira is new to me
so recommendations on jira, per mesos, in a document, would be keen.)



btw, I have limited knowledge of CoreOS, will look into it.


CoreOS has some great ideas. But many of their codes are not current
(when compared to the gentoo portage tree) and thus many are suspect
for security/function.

I thought the purpose was to get more folks involved here in discussions
and then better formulated ideas  can migrate to the ticket (5545)  and 
repos.






TIMING DATA:: This is the main issue I see. Once you start 'vectoring
in resources' you need to add timing (latency) data to encourage robust
and diversified use of of this data. For HPC, this could be very
valuable for rDMA abusive algorithms where memory constrained workloads
not only need the knowledge of additional nearby memory resources, but
the approximated (based on previous data collected) latency and
bandwidth constraints to use those additional resources.


Out of curiosity, which open sourced Mesos framework do you/your
customer run MPI?


Easy dude.Most of this work in tightly help and nothing to publish
or open up yet. It's a mess (my professional opinion) right now and
I'm testing a variety of tools just be able to have better 
instrumentation on these codes. Still rDMA is very attractive so it does 
warrant much attention and extreme, internal, excitement.






Mesos can support MPI framework, but AFIK, it's immature [1][2].


YEP.


I think this part of work should be investigated in future.

[1]: https://github.com/apache/mesos/tree/master/mpi   <- mpd ring version
[2]:https://github.com/mesosphere/mesos-hydra <- hydra version


Many codes floating around. Much excitement on new compiler features. 
Lots of hard work and testing going on. That said, the point I was try 
to make is "Vectoring in" resources, with a variety of parameters as a 
companion to your idea, is warranted for these aforementioned use cases

and other opportunities.




Great idea. I do like it very much.

hth,
James


On 06/06/2016 05:06 AM, Stephen Gran wrote:

Hi,

This looks potentially interesting.  How does it work in a public cloud
deployment scenario?  I assume you would just have to disable this
feature, or not enable it?

Cheers,

On 06/06/16 10:17, Du, Fan wrote:

Hi, Mesos folks

I’ve been thinking about Mesos rack awareness support for a while,

it’s a common interest for lots of data center applications to provide
data locality,

fault tolerance and better task placement. Create MESOS-5545 to track
the story,

and here is the initial design doc [1] to support rack awareness in
Mesos.

Looking forward to hear any comments from end user and other
developers,

Thanks!

[1]:
https://docs.google.com/document/d/1rql_LZSwtQzBPALnk0qCLsmxcT3-zB7X7aJp-H3xxyE/edit?usp=sharing















Re: Mesos 0.24.1 on Raspberry Pi 3

2016-06-07 Thread james


I thought systemd was not mandatory in version 0.25 and later?

James


On 06/07/2016 07:42 AM, tommy xiao wrote:

only 0.24 can work on it. 0.25 use systemd and can't ignore it.

2016-06-07 7:50 GMT+08:00 Benjamin Mahler <bmah...@apache.org
<mailto:bmah...@apache.org>>:

Cool stuff Andrew, thanks for sharing!

On Thu, Jun 2, 2016 at 11:50 AM, Andrew Spyker
<aspy...@netflix.com.invalid>
wrote:

 > FYI, based on the work others have done in the past, Netflix was
able to
 > get Mesos agent building and running on Raspberry Pi natively and
under
 > Docker containers.  Please see this blog for the information:
 >
 > bit.ly/TitusOnPi <http://bit.ly/TitusOnPi>
 >
 > --
 > Andrew Spyker (aspy...@netflix.com <mailto:aspy...@netflix.com>)
 > Twitter:  @aspyker  Blog: ispyker.blogspot.com
<http://ispyker.blogspot.com>
 >




--
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com <http://gmail.com>




Re: Rack awareness support for Mesos

2016-06-06 Thread james

Hello,


@Stephen::I guess Stephen is bringing up the 'security' aspect of who 
get's access to the information, particularly cluster/cloud devops, 
customers or interlopers?



@Fan:: As a consultant, most of my customers either have  or are 
planning hybrid installations, where some codes run on a local cluster 
or using 'the cloud' for dynamic load requirements. I would think your 
proposed scheme needs to be very flexible, both in application to a 
campus or Metropolitan Area Network, if not massively distributed around 
the globe. What about different resouce types (racks of arm64, gpu 
centric hardware, DSPs, FPGA etc etc. Hardware diversity bring many

benefits to the cluster/cloud capabilities.


This also begs the quesion of hardware management (boot/config/online)
of the various hardware, such as is built into coreOS. Are several 
applications going to be supported? Standards track? Just Mesos DC/OS

centric?


TIMING DATA:: This is the main issue I see. Once you start 'vectoring
in resources' you need to add timing (latency) data to encourage robust
and diversified use of of this data. For HPC, this could be very 
valuable for rDMA abusive algorithms where memory constrained workloads 
not only need the knowledge of additional nearby memory resources, but
the approximated (based on previous data collected) latency and 
bandwidth constraints to use those additional resources.



Great idea. I do like it very much.

hth,
James


On 06/06/2016 05:06 AM, Stephen Gran wrote:

Hi,

This looks potentially interesting.  How does it work in a public cloud
deployment scenario?  I assume you would just have to disable this
feature, or not enable it?

Cheers,

On 06/06/16 10:17, Du, Fan wrote:

Hi, Mesos folks

I’ve been thinking about Mesos rack awareness support for a while,

it’s a common interest for lots of data center applications to provide
data locality,

fault tolerance and better task placement. Create MESOS-5545 to track
the story,

and here is the initial design doc [1] to support rack awareness in Mesos.

Looking forward to hear any comments from end user and other developers,

Thanks!

[1]:
https://docs.google.com/document/d/1rql_LZSwtQzBPALnk0qCLsmxcT3-zB7X7aJp-H3xxyE/edit?usp=sharing







Re: OSX 10.10.5 and mesos 0.28.1 -- 10 to 20 X difference in sleep() method compared to non mesos

2016-06-04 Thread James Mulcahy

Hi Rinaldo,

MacOS X has a variety of mechanisms designed to improve energy efficiency, and 
many of these impact timer behavior.  I suspect that this is what is affecting 
you.   There is a whitepaper here, which has more details 
https://www.apple.com/media/us/osx/2013/docs/OSX_Power_Efficiency_Technology_Overview.pdf

—James

> On Jun 3, 2016, at 13:11, DiGiorgio, Mr. Rinaldo S. <rdigior...@pace.edu> 
> wrote:
> 
> Hi,
> 
>   We are running the following Java application and we are getting 
> unreasonable deltas in the actual amount time slept. On linux the results are 
> as expected 10, 11, 12 but mostly 10ms.  Can you suggest any changes we can 
> make or is this a known issue or a new issue to be investigated? When we run 
> the same code on the same instance of OSX 10.10.5 without mesos  -- we get 
> the expected results. 
> 
> 
> public class SleepLatency {
>   static final int COUNT = 100;
>   static final long DELAY = 10L;
> 
>   public static void main(String[] args) throws Exception {
>   long tstart = System.currentTimeMillis();
>   for (int i = 0; i < COUNT; i++) {
>   long t0 = System.currentTimeMillis();
>   Thread.sleep(DELAY);
>   long t1 = System.currentTimeMillis();
>   System.out.printf("loop %3d delay %4d ms%n", i, t1 - t0);
>   }
>   long tfinish = System.currentTimeMillis();
>   System.out.printf("total time = %5d ms%n", tfinish - tstart);
>   }
> }
> 
> == OSX   RESULTS are 10 to 20 times  larger than LINUX Results below =
> 
> sh -c '/opt/jdk/bin/java -cp ./mach5-mesos-support-1.0-SNAPSHOT.jar 
> SleepLatency'
> loop   0 delay  141 ms
> loop   1 delay  201 ms
> loop   2 delay   81 ms
> loop   3 delay   14 ms
> loop   4 delay  194 ms
> loop   5 delay  149 ms
> loop   6 delay  172 ms
> loop   7 delay  203 ms
> loop   8 delay  203 ms
> loop   9 delay  204 ms
> loop  10 delay  204 ms
> loop  11 delay  204 ms
> loop  12 delay  203 ms
> loop  13 delay  203 ms
> loop  14 delay   40 ms
> loop  15 delay  206 ms
> loop  16 delay  171 ms
> loop  17 delay  107 ms
> loop  18 delay   85 ms
> loop  19 delay  204 ms
> loop  20 delay  204 ms
> loop  21 delay  203 ms
> loop  22 delay  208 ms
> loop  23 delay  200 ms
> loop  24 delay  203 ms
> loop  25 delay  203 ms
> loop  26 delay  204 ms
> loop  27 delay  204 ms
> loop  28 delay  120 ms
> loop  29 delay   83 ms
> loop  30 delay  204 ms
> loop  31 delay  203 ms
> loop  32 delay  204 ms
> loop  33 delay  208 ms
> loop  34 delay  199 ms
> loop  35 delay  204 ms
> loop  36 delay  175 ms
> loop  37 delay   11 ms
> loop  38 delay  115 ms
> loop  39 delay  205 ms
> loop  40 delay  204 ms
> loop  41 delay   11 ms
> loop  42 delay   91 ms
> loop  43 delay  202 ms
> loop  44 delay  203 ms
> loop  45 delay  204 ms
> loop  46 delay  209 ms
> loop  47 delay  112 ms
> loop  48 delay   16 ms
> loop  49 delay   69 ms
> loop  50 delay  204 ms
> loop  51 delay   18 ms
> loop  52 delay   14 ms
> loop  53 delay   70 ms
> loop  54 delay   33 ms
> loop  55 delay  184 ms
> loop  56 delay  199 ms
> loop  57 delay  194 ms
> loop  58 delay  102 ms
> loop  59 delay  102 ms
> loop  60 delay   12 ms
> loop  61 delay  197 ms
> loop  62 delay  204 ms
> loop  63 delay  204 ms
> loop  64 delay  206 ms
> loop  65 delay   11 ms
> loop  66 delay  180 ms
> loop  67 delay  202 ms
> loop  68 delay   10 ms
> loop  69 delay   20 ms
> loop  70 delay  199 ms
> loop  71 delay  179 ms
> loop  72 delay  202 ms
> loop  73 delay   33 ms
> loop  74 delay   69 ms
> loop  75 delay   14 ms
> loop  76 delay   88 ms
> loop  77 delay  204 ms
> loop  78 delay  209 ms
> loop  79 delay  198 ms
> loop  80 delay  204 ms
> loop  81 delay   25 ms
> loop  82 delay   76 ms
> loop  83 delay  102 ms
> loop  84 delay  173 ms
> loop  85 delay   13 ms
> loop  86 delay   17 ms
> loop  87 delay   14 ms
> loop  88 delay  191 ms
> loop  89 delay  204 ms
> loop  90 delay  204 ms
> loop  91 delay  102 ms
> loop  92 delay   47 ms
> loop  93 delay   37 ms
> loop  94 delay  142 ms
> loop  95 delay  202 ms
> loop  96 delay  204 ms
> loop  97 delay  202 ms
> loop  98 delay  104 ms
> loop  99 delay   80 ms
> total time = 14193 ms
> 
> 
> == LINUX   RESULTS are as expected ==
> 
> sh -c '/opt/jdk/bin/java -cp ./mach5-mesos-support-1.0-SNAPSHOT.jar 
> SleepLatency'
> Forked command at 6125
> loop   0 delay   10 ms
> loop   1 delay   11 ms
> loop   2 delay   10 ms
> loop   3 delay   10 ms
> loop   4 delay   10 ms
> loop   5 delay   10 ms
> loop   6 delay   10 ms
> loop   7 delay   1

Re: 1.0 Release Candidate

2016-05-25 Thread james

On 05/25/2016 10:52 AM, Vinod Kone wrote:

Hi folks,

As discussed in the previous community sync, we plan to cut a release
candidate for our next release (1.0) early next week.

1.0 is mainly centered around new APIs for Mesos. Please take a look at
MESOS-338 <https://issues.apache.org/jira/browse/MESOS-338> for blocking
issues. We got some great design and testing feedback for the v1
scheduler and executor APIs. Please do the same for the in-progress v1
operator API
<https://docs.google.com/document/d/1XfgF4jDXZDVIEWQPx6Y4glgeTTswAAxw6j8dPDAtoeI/edit?pref=2=1#>.

Since this is a 1.0, we would like to do the release a little differently.

First, the voting period for vetting the release candidate would be a
few weeks (2-3 weeks) instead of the typical 3 days.

Second, we are wiling to make major changes (scalability fixes, API
fixes) if there are any issues reported by the community.

We are doing these because we really want the community to thoroughly
test the 1.0 release and give feedback.




Exciting announcement! I have not updated the gentoo ebuilds (packages) 
for some time. Perhaps some pre-releases of 1.0, so I can work through
all the raw compilation issues, at least for a 100% source build for 
gentoo, as the community and devs work on the rest of the stabilization 
issues?  pre-release would mostly be to work through builds from 100% 
source, so it would not matter how things change on the pathway to the 
official releases for 1.0.


This work would likely benefit platform building up from 100 % sources 
and it would therefore be an excellent effort is anyone wanted to do the 
same on a linux distro other than gentoo, or on a new hardware platform, 
such as arm64v8. I would be willing to work on the arm64v8 source 
builds, with anyone interested, as I'm currently looking for a
low cost virtual host to work on those 64 bit arm source-builds; speak 
up if anyone has a recommendation for hosted arm64v8 hardware.




Additionally, if others are interested in Mesos for HPC, now would be an 
excellent time to bring forth any bugs or issues related to using mesos 
for HPC (low latency) clusters.



curiously,
James






Re: How is the OS X environment created with Mesos

2016-05-18 Thread James Peach
This probably boils down to not being in the right launchd session.
launchd(8) discusses this at a high level. You can see what is going
on in your user session with "launchctl print user/$(id -u)".

I'm not sure what the right mechanics ought to be for Mesos. It used
to be that you would use the "bsexec" subcommand to run something in a
different session, but that is deprecated and I don't see an obvious
replacement in the new subcommands. Maybe worth asking on the
launchd-dev mailing list ...


On 11 May 2016 at 12:10, DiGiorgio, Mr. Rinaldo S. <rdigior...@pace.edu> wrote:
>
> On May 5, 2016, at 13:28, haosdent <haosd...@gmail.com> wrote:
>
>>There is no explicit statement about what Mesos means when it runs a task
>> as some other user.
> I think this is just ensure the running user of the task is the user you
> given. In Mesos, it jus call the [setuid](http://linux.die.net/man/2/setuid)
> to change the user, It would not execute something like the bashrc script of
> user.
>
>
> I have been unable to solve this problem for the last few days. I am
> wondering if you have any ideas.
>
>
>
> When Mesos starts a task on an OSX machine, the task is run with setuid to
> the user I have asked for.  When that user runs I cannot get that user to
> have a default login keychain.  I want to initialize the environment so that
> user has something that looks like this.
>
>  existinguser$ security login-keychain
>
>
>  "/Users/rinaldo/Library/Keychains/login.keychain”
>
>
> I have tried many options to create the above keychain for the other user
> that is running in a process that was created by mesos and changed to that
> user with setuid.
>
> I understand that is likely not a Mesos issue. I am hoping someone on this
> alias has come across this issue or something similar.  I have tried the
> following and they have all failed.
>
> su -c   as existinguser
>
> /bin/login as existinguser
>
> OSX is not Open Source so it is difficult to understand what it is they do
> to create a user environment.  The “security” application has many options
> to create keychains but when I use those options the Keychains endup in
>
>
> "/Library/Keychains/System.keychain"
>
>"/Library/Keychains/System.keychain”
>
>
>   I have no investigated how a user is able to create a keychain in the
> System.keychain when running as a user in a Mesos created process.
>
>
> Rinaldo
>
>
>
>
>
> On Thu, May 5, 2016 at 7:41 PM, DiGiorgio, Mr. Rinaldo S.
> <rdigior...@pace.edu> wrote:
>>
>> Hi,
>>
>> Recently I noticed that the Mesos Jenkins plugin supports the
>> setting of environment variables. Somewhere between 0.26 and 0.28.1,
>> settings like
>>
>> USER=
>> HOME=
>>
>> were required to get things to work the way they had worked. I
>> have been able to set the environment this way but I have some concerns
>> about it.
>>
>> There is no explicit statement about what Mesos means when it runs
>> a task as some other user.  Clearly it is not running some of the scripts
>> normally run during login.  This was a constant source of confusion with
>> Jenkins. If one can state what exactly is done to create the user
>> environment each platform and how it is different that others it will save
>> countless hours of debugging IMO. I realize OSX is an odd system -- linux at
>> times, Apple specific at times in areas that conflict with Linux but this
>> will only get more complicated when Windows agents become available.
>>
>>
>>
>> Rinaldo
>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>



-- 
James Peach | jor...@gmail.com


Re: distributed file systems

2016-05-11 Thread james

On 05/11/2016 10:09 AM, Aaron Carey wrote:

What exactly do you mean by deploying a mesos cluster to run on ceph etc?

Do you mean having a clustered file system mounted via nfs to the hosts which 
contains the
mesos binaries?


That would be one way to use a DFS, but low latency on a variety of 
different linux kernels will be evaluated too. So will the testing of 
ulibc-ng, musl and other fundamental components, as to the trade offs

of performance vs robustness of supported frameworks.



Or something to do with how jobs are executed?


YES, exactly. In fact, it is most interesting to test with Hi 
Performance Computing (HPC) on top of Mesos.  The simple way to think 
about one of the ultimate goals of the research is how to completely 
replace Hadoop with Mesos, a DFS and many other components. Portability, 
not limited to Arm64v8 is a companion track related to HPC on clusters 
and bare metal hardware. But most HPC installations moving forward, will 
have to also support some typical (admin/web/security/AI/etc) types of 
workloads, as part of the evaluation.



Mesos does seem to be flexible enough for these sorts of explorations
and experiments. There is no inclination for constricting ideas and 
testing of competing components; only the desire to experiment, share
and refine the component mix (aka the cluster-stack), all in open 
forums. I. E. Cluster-Stack_A vs Cluster-Stack_B etc etc for various 
workloads, on fixed, modest cluster sizes.



Think of it as experimentation that will eventually lead to published 
results, such as what mips/mops and various benchmarks have done for 
hardware, only the cluster-stack is the variable(s) under study.


All ideas, comments and criticisms are most welcome. HDFS is mostly 
described as the main bottleneck to HPC efforts and much of what folks 
are doing with various DFS replacements is being too strictly 
constrained, imho.



James



Aaron Carey
Production Engineer - Cloud Pipeline
Industrial Light & Magic
London
020 3751 9150


From: james [gar...@verizon.net]
Sent: 11 May 2016 17:08
To: user@mesos.apache.org
Subject: Re: distributed file systems

Hello Rodrick,

That EFS looks interesting, but I did not find the location for the
source-code/git download?  I do not remember the (linux) kernel hooks
for that Distributed File System, or is completely on top of the systems
codes?

License details and I'm not sure if it's 100% opensource?

Beegfs [A] is partially opensource, but that does not fit what is needed
for experimentation. A robust community around open sources and tools,
such as github, should have been mentioned. Equally important
is a community keen on sharing and supporting other efforts to replicate
and use the components of these cluster centric codes. [B,C]


James

[A] http://www.beegfs.com/content/

[B] https://forums.aws.amazon.com/thread.jspa?threadID=217783

[C]
http://searchaws.techtarget.com/news/4500272286/Amazon-EFS-stuck-in-beta-lacks-native-Windows-support




On 05/11/2016 01:07 AM, Rodrick Brown wrote:

Does EFS count? :-)

https://aws.amazon.com/efs/


--

*Rodrick Brown* / Systems Engineer

+1 917 445 6839 / rodr...@orchardplatform.com
<mailto:char...@orchardplatform.com>

*Orchard Platform*

101 5th Avenue, 4th Floor, New York, NY 10003

http://www.orchardplatform.com <http://www.orchardplatform.com/>

Orchard Blog <http://www.orchardplatform.com/blog/> | Marketplace
Lending Meetup <http://www.meetup.com/Peer-to-Peer-Lending-P2P/>


On May 10 2016, at 9:07 pm, james <gar...@verizon.net> wrote:

 Hello,


 Has anyone customized/compiled mesos and successfully deployed a mesos
 cluster to run on cephfs, orangefs [1], or any other distributed file
 systems?

 If so, some detail on your setup would be appreciated.


 [1]
 http://www.phoronix.com/scan.php?page=news_item=OrangeFS-Lands-Linux-4.6


*NOTICE TO RECIPIENTS*: This communication is confidential and intended
for the use of the addressee only. If you are not an intended recipient
of this communication, please delete it immediately and notify the
sender by return email. Unauthorized reading, dissemination,
distribution or copying of this communication is prohibited. This
communication does not constitute an offer to sell or a solicitation of
an indication of interest to purchase any loan, security or any other
financial product or instrument, nor is it an offer to sell or a
solicitation of an indication of interest to purchase any products or
services to any persons who are prohibited from receiving such
information under applicable law. The contents of this communication may
not be accurate or complete and are subject to change without notice. As
such, Orchard App, Inc. (including its subsidiaries and affiliates,
"Orchard") makes no representation regarding the accuracy or
completeness of the information contained herein. The intended recipient
is advi

Re: distributed file systems

2016-05-11 Thread james

Hello Rodrick,

That EFS looks interesting, but I did not find the location for the 
source-code/git download?  I do not remember the (linux) kernel hooks

for that Distributed File System, or is completely on top of the systems
codes?

License details and I'm not sure if it's 100% opensource?

Beegfs [A] is partially opensource, but that does not fit what is needed 
for experimentation. A robust community around open sources and tools, 
such as github, should have been mentioned. Equally important

is a community keen on sharing and supporting other efforts to replicate
and use the components of these cluster centric codes. [B,C]


James

[A] http://www.beegfs.com/content/

[B] https://forums.aws.amazon.com/thread.jspa?threadID=217783

[C] 
http://searchaws.techtarget.com/news/4500272286/Amazon-EFS-stuck-in-beta-lacks-native-Windows-support





On 05/11/2016 01:07 AM, Rodrick Brown wrote:

Does EFS count? :-)

https://aws.amazon.com/efs/


--

*Rodrick Brown* / Systems Engineer

+1 917 445 6839 / rodr...@orchardplatform.com
<mailto:char...@orchardplatform.com>

*Orchard Platform*

101 5th Avenue, 4th Floor, New York, NY 10003

http://www.orchardplatform.com <http://www.orchardplatform.com/>

Orchard Blog <http://www.orchardplatform.com/blog/> | Marketplace
Lending Meetup <http://www.meetup.com/Peer-to-Peer-Lending-P2P/>


On May 10 2016, at 9:07 pm, james <gar...@verizon.net> wrote:

Hello,


Has anyone customized/compiled mesos and successfully deployed a mesos
cluster to run on cephfs, orangefs [1], or any other distributed file
systems?

If so, some detail on your setup would be appreciated.


[1]
http://www.phoronix.com/scan.php?page=news_item=OrangeFS-Lands-Linux-4.6


*NOTICE TO RECIPIENTS*: This communication is confidential and intended
for the use of the addressee only. If you are not an intended recipient
of this communication, please delete it immediately and notify the
sender by return email. Unauthorized reading, dissemination,
distribution or copying of this communication is prohibited. This
communication does not constitute an offer to sell or a solicitation of
an indication of interest to purchase any loan, security or any other
financial product or instrument, nor is it an offer to sell or a
solicitation of an indication of interest to purchase any products or
services to any persons who are prohibited from receiving such
information under applicable law. The contents of this communication may
not be accurate or complete and are subject to change without notice. As
such, Orchard App, Inc. (including its subsidiaries and affiliates,
"Orchard") makes no representation regarding the accuracy or
completeness of the information contained herein. The intended recipient
is advised to consult its own professional advisors, including those
specializing in legal, tax and accounting matters. Orchard does not
provide legal, tax or accounting advice.




Simple question about slave<->master latency?

2016-05-11 Thread James Vanns
Is the latency (perhaps the weighted rolling average) between master and a
slave measured? If so, is it recorded as an attribute of a slave object in
the scheduler API?

Cheers,

Jim

--
Senior Production Engineer
Industrial Light & Magic


distributed file systems

2016-05-10 Thread james

Hello,


Has anyone customized/compiled mesos and successfully deployed  a mesos 
cluster to run on cephfs, orangefs [1], or any other distributed file 
systems?


If so, some detail on your setup would be appreciated.


[1] 
http://www.phoronix.com/scan.php?page=news_item=OrangeFS-Lands-Linux-4.6






Re: Enable s3a for fetcher

2016-05-10 Thread Briant, James
Ok. Thanks Joseph. I will figure out how to get a more recent hadoop onto my 
dcos agents then.

Jamie

From: Joseph Wu <jos...@mesosphere.io<mailto:jos...@mesosphere.io>>
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Date: Tuesday, May 10, 2016 at 1:40 PM
To: user <user@mesos.apache.org<mailto:user@mesos.apache.org>>
Subject: Re: Enable s3a for fetcher

I can't speak to what DCOS does or will do (you can ask on the associated 
mailing list: us...@dcos.io<mailto:us...@dcos.io>).

We will be maintaining existing functionality for the fetcher, which means 
supporting the schemes:
* file
* http, https, ftp, ftps
* hdfs, hftp, s3, s3n  <--  These rely on hadoop.

And we will retain the --hadoop_home agent flag, which you can use to specify 
the hadoop binary.

Other schemes might work right now, if you hack around with your node setup.  
But there's no guarantee that your hack will work between Mesos versions.  In 
future, we will associate a fetcher plugin for each scheme.  And you will be 
able to load custom fetcher plugins for additional schemes.
TLDR: no "nerfing" and less hackiness :)

On Tue, May 10, 2016 at 12:58 PM, Briant, James 
<james.bri...@thermofisher.com<mailto:james.bri...@thermofisher.com>> wrote:
This is the mesos latest documentation:

If the requested URI is based on some other protocol, then the fetcher tries to 
utilise a local Hadoop client and hence supports any protocol supported by the 
Hadoop client, e.g., HDFS, S3. See the slave configuration 
documentation<http://mesos.apache.org/documentation/latest/configuration/> for 
how to configure the slave with a path to the Hadoop client. [emphasis added]

What you are saying is that dcos simply wont install hadoop on agents?

Next question then: will you be nerfing fetcher.cpp, or will I be able to 
install hadoop on the agents myself, such that mesos will recognize s3a?


From: Joseph Wu <jos...@mesosphere.io<mailto:jos...@mesosphere.io>>
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Date: Tuesday, May 10, 2016 at 12:20 PM
To: user <user@mesos.apache.org<mailto:user@mesos.apache.org>>

Subject: Re: Enable s3a for fetcher

Mesos does not explicitly support HDFS and S3.  Rather, Mesos will assume you 
have a hadoop binary and use it (blindly) for certain types of URIs.  If the 
hadoop binary is not present, the mesos-fetcher will fail to fetch your HDFS or 
S3 URIs.

Mesos does not ship/package hadoop, so these URIs are not expected to work out 
of the box (for plain Mesos distributions).  In all cases, the operator must 
preconfigure hadoop on each node (similar to how Docker in Mesos works).

Here's the epic tracking the modularization of the mesos-fetcher (I estimate 
it'll be done by 0.30):
https://issues.apache.org/jira/browse/MESOS-3918

^ Once done, it should be easier to plug in more fetchers, such as one for your 
use-case.

On Tue, May 10, 2016 at 11:21 AM, Briant, James 
<james.bri...@thermofisher.com<mailto:james.bri...@thermofisher.com>> wrote:
I’m happy to have default IAM role on the box that can read-only fetch from my 
s3 bucket. s3a gets the credentials from AWS instance metadata. It works.

If hadoop is gone, does that mean that hfds: URIs don’t work either?

Are you saying dcos and mesos are diverging? Mesos explicitly supports hdfs and 
s3.

In the absence of S3, how do you propose I make large binaries available to my 
cluster, and only to my cluster, on AWS?

Jamie

From: Cody Maloney <c...@mesosphere.io<mailto:c...@mesosphere.io>>
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Date: Tuesday, May 10, 2016 at 10:58 AM
To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Subject: Re: Enable s3a for fetcher

The s3 fetcher stuff inside of DC/OS is not supported. The `hadoop` binary has 
been entirely removed from DC/OS 1.8 already. There have been various proposals 
to make it so the mesos fetcher is much more pluggable / extensible 
(https://issues.apache.org/jira/browse/MESOS-2731 for instance).

Generally speaking people want a lot of different sorts of fetching, and there 
are all sorts of questions of how to properly get auth to the various chunks 
(if you're using s3a:// presumably you need to get credentials there somehow. 
Otherwise you could just use http://). Need to design / build that into Mesos 
and DC/OS to be able to use this stuff.

Cody

On Tue, May 10, 2016 at 9:55 AM Briant, James 
<james.bri...@thermofisher.com<mailto:james.bri...@thermofisher.com>> wrote:
I want to use s3a: urls in fetcher. I’

Re: Enable s3a for fetcher

2016-05-10 Thread Briant, James
This is the mesos latest documentation:

If the requested URI is based on some other protocol, then the fetcher tries to 
utilise a local Hadoop client and hence supports any protocol supported by the 
Hadoop client, e.g., HDFS, S3. See the slave configuration 
documentation<http://mesos.apache.org/documentation/latest/configuration/> for 
how to configure the slave with a path to the Hadoop client. [emphasis added]

What you are saying is that dcos simply wont install hadoop on agents?

Next question then: will you be nerfing fetcher.cpp, or will I be able to 
install hadoop on the agents myself, such that mesos will recognize s3a?


From: Joseph Wu <jos...@mesosphere.io<mailto:jos...@mesosphere.io>>
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Date: Tuesday, May 10, 2016 at 12:20 PM
To: user <user@mesos.apache.org<mailto:user@mesos.apache.org>>
Subject: Re: Enable s3a for fetcher

Mesos does not explicitly support HDFS and S3.  Rather, Mesos will assume you 
have a hadoop binary and use it (blindly) for certain types of URIs.  If the 
hadoop binary is not present, the mesos-fetcher will fail to fetch your HDFS or 
S3 URIs.

Mesos does not ship/package hadoop, so these URIs are not expected to work out 
of the box (for plain Mesos distributions).  In all cases, the operator must 
preconfigure hadoop on each node (similar to how Docker in Mesos works).

Here's the epic tracking the modularization of the mesos-fetcher (I estimate 
it'll be done by 0.30):
https://issues.apache.org/jira/browse/MESOS-3918

^ Once done, it should be easier to plug in more fetchers, such as one for your 
use-case.

On Tue, May 10, 2016 at 11:21 AM, Briant, James 
<james.bri...@thermofisher.com<mailto:james.bri...@thermofisher.com>> wrote:
I’m happy to have default IAM role on the box that can read-only fetch from my 
s3 bucket. s3a gets the credentials from AWS instance metadata. It works.

If hadoop is gone, does that mean that hfds: URIs don’t work either?

Are you saying dcos and mesos are diverging? Mesos explicitly supports hdfs and 
s3.

In the absence of S3, how do you propose I make large binaries available to my 
cluster, and only to my cluster, on AWS?

Jamie

From: Cody Maloney <c...@mesosphere.io<mailto:c...@mesosphere.io>>
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Date: Tuesday, May 10, 2016 at 10:58 AM
To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Subject: Re: Enable s3a for fetcher

The s3 fetcher stuff inside of DC/OS is not supported. The `hadoop` binary has 
been entirely removed from DC/OS 1.8 already. There have been various proposals 
to make it so the mesos fetcher is much more pluggable / extensible 
(https://issues.apache.org/jira/browse/MESOS-2731 for instance).

Generally speaking people want a lot of different sorts of fetching, and there 
are all sorts of questions of how to properly get auth to the various chunks 
(if you're using s3a:// presumably you need to get credentials there somehow. 
Otherwise you could just use http://). Need to design / build that into Mesos 
and DC/OS to be able to use this stuff.

Cody

On Tue, May 10, 2016 at 9:55 AM Briant, James 
<james.bri...@thermofisher.com<mailto:james.bri...@thermofisher.com>> wrote:
I want to use s3a: urls in fetcher. I’m using dcos 1.7 which has hadoop 2.5 on 
its agents. This version has the necessary hadoop-aws and aws-sdk:

hadoop--afadb46fe64d0ee7ce23dbe769e44bfb0767a8b9]$ ls 
usr/share/hadoop/tools/lib/ | grep aws
aws-java-sdk-1.7.4.jar
hadoop-aws-2.5.0-cdh5.3.3.jar

What config/scripts do I need to hack to get these guys on the classpath so 
that "hadoop fs -copyToLocal” works?

Thanks,
Jamie



Re: Enable s3a for fetcher

2016-05-10 Thread Briant, James
I’m happy to have default IAM role on the box that can read-only fetch from my 
s3 bucket. s3a gets the credentials from AWS instance metadata. It works.

If hadoop is gone, does that mean that hfds: URIs don’t work either?

Are you saying dcos and mesos are diverging? Mesos explicitly supports hdfs and 
s3.

In the absence of S3, how do you propose I make large binaries available to my 
cluster, and only to my cluster, on AWS?

Jamie

From: Cody Maloney <c...@mesosphere.io<mailto:c...@mesosphere.io>>
Reply-To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Date: Tuesday, May 10, 2016 at 10:58 AM
To: "user@mesos.apache.org<mailto:user@mesos.apache.org>" 
<user@mesos.apache.org<mailto:user@mesos.apache.org>>
Subject: Re: Enable s3a for fetcher

The s3 fetcher stuff inside of DC/OS is not supported. The `hadoop` binary has 
been entirely removed from DC/OS 1.8 already. There have been various proposals 
to make it so the mesos fetcher is much more pluggable / extensible 
(https://issues.apache.org/jira/browse/MESOS-2731 for instance).

Generally speaking people want a lot of different sorts of fetching, and there 
are all sorts of questions of how to properly get auth to the various chunks 
(if you're using s3a:// presumably you need to get credentials there somehow. 
Otherwise you could just use http://). Need to design / build that into Mesos 
and DC/OS to be able to use this stuff.

Cody

On Tue, May 10, 2016 at 9:55 AM Briant, James 
<james.bri...@thermofisher.com<mailto:james.bri...@thermofisher.com>> wrote:
I want to use s3a: urls in fetcher. I’m using dcos 1.7 which has hadoop 2.5 on 
its agents. This version has the necessary hadoop-aws and aws-sdk:

hadoop--afadb46fe64d0ee7ce23dbe769e44bfb0767a8b9]$ ls 
usr/share/hadoop/tools/lib/ | grep aws
aws-java-sdk-1.7.4.jar
hadoop-aws-2.5.0-cdh5.3.3.jar

What config/scripts do I need to hack to get these guys on the classpath so 
that "hadoop fs -copyToLocal” works?

Thanks,
Jamie


Enable s3a for fetcher

2016-05-10 Thread Briant, James
I want to use s3a: urls in fetcher. I’m using dcos 1.7 which has hadoop 2.5 on 
its agents. This version has the necessary hadoop-aws and aws-sdk:

hadoop--afadb46fe64d0ee7ce23dbe769e44bfb0767a8b9]$ ls 
usr/share/hadoop/tools/lib/ | grep aws
aws-java-sdk-1.7.4.jar
hadoop-aws-2.5.0-cdh5.3.3.jar

What config/scripts do I need to hack to get these guys on the classpath so 
that "hadoop fs -copyToLocal” works?

Thanks,
Jamie 

Re: [Proposal] Remove the default value for agent work_dir

2016-04-12 Thread James Peach

> On Apr 12, 2016, at 3:58 PM, Greg Mann  wrote:
> 
> Hey folks!
> A number of situations have arisen in which the default value of the Mesos 
> agent `--work_dir` flag (/tmp/mesos) has caused problems on systems in which 
> the automatic cleanup of '/tmp' deletes agent metadata. To resolve this, we 
> would like to eliminate the default value of the agent `--work_dir` flag. You 
> can find the relevant JIRA here.
> 
> We considered simply changing the default value to a more appropriate 
> location, but decided against this because the expected filesystem structure 
> varies from platform to platform, and because it isn't guaranteed that the 
> Mesos agent would have access to the default path on a particular platform.
> 
> Eliminating the default `--work_dir` value means that the agent would exit 
> immediately if the flag is not provided, whereas currently it launches 
> successfully in this case. This will break existing infrastructure which 
> relies on launching the Mesos agent without specifying the work directory. I 
> believe this is an acceptable change because '/tmp/mesos' is not a suitable 
> location for the agent work directory except for short-term local testing, 
> and any production scenario that is currently using this location should be 
> altered immediately.

+1 from me too. Defaulting to /tmp just helps people shoot themselves in the 
foot.

J

Re: Does mesos support running two slaves on a single host?

2016-03-22 Thread James Mulcahy

Hi Shiyao,

Yes, mesos supports this and when testing I often run with two local slaves.  
You need to ensure that you run each slave with distinct —work_dir, and —port 
arguments.  

It likely doesn’t make sense for a production deployment however.  What is it 
you really want to do here?

—James


> On Mar 22, 2016, at 07:17, Shiyao Ma <i...@introo.me> wrote:
> 
> Hi,
> 
> I vaguely remember last time I asked a question on this lists. Some comments 
> mentioned that, a host can only run a single slave.
> Is that true?
> 
> Regards.
> 
> -- 
> 
> 吾輩は猫である。ホームーページはhttps://introo.me <http://introo.me/>。



Re: verbose logging with the docker executor

2016-03-19 Thread James Peach

> On Mar 17, 2016, at 10:09 AM, Clarke, Trevor  wrote:
> 
> Looking in the docker executor, the docker command line is logged with 
> VLOG(1) but I'm not sure how to generate that level of log output. Some 
> googling suggests it's used in the google logging library and verbose logging 
> would be enabled with something like --v=1 but that's not a valid mesos-slave 
> option. Can someone point me in the right direction? (currently using 0.24.1)

You can set the GLOG_v environment variable (see 
https://google-glog.googlecode.com/svn/trunk/doc/glog.html#verbose) to the 
desired verbosity level and then restart mesos-slave. If you just want to 
increase the log level without a restart, you can hit the /logging/toggle 
endpoint on the mesos-slave (do curl http://127.0.0.1:5051/help/logging/toggle 
for the online help).

J

Re: Run Kubernetes on top of Mesos by a Docker-Compose.yml

2016-01-12 Thread James DeFelice
Look at cluster/mesos/docker in the official k8s github repo.
On Jan 12, 2016 7:57 AM, "Den Cowboy"  wrote:

> I'm searching for a tutorial or 'example' of a docker-compose.yml which is
> using mesos, and kubernetes (on top of it).
> Is there something which I can use?
>
> Thanks
>


Re: Option to not include RPATHs in executables.

2015-12-14 Thread James DeFelice
You might get a faster response by emailing mesos dev list.

On Mon, Dec 14, 2015 at 11:13 AM, Hans van den Bogert <hansbog...@gmail.com>
wrote:

> Hi,
>
> My environment:
> - GCC 4.9 is installed in a non-standard way. (can’t change this) and uses
> the  LD_LIBRARY_PATH to run/compile correctly
>
> Mesos executables have the  /usr/lib64 path in their RPATHs.
> The problem is that the default libstdc++ (in /usr/lib64) is now used
> instead of the libstdc++ with which Mesos was compiled. Multiple errors
> happen, like:
> /somepath/sbin/mesos-slave: /usr/lib64/libstdc++.so.6: version
> `GLIBCXX_3.4.14' not found (required by /somepath/sbin/mesos-slave)
>
> Does Mesos have an option during configure to disable RPATHs attached to
> executables? In my environment  RPATHs are not helping and I’ve already
> seen that removing  the RPATHs (using chrpath) after installing remedies my
> situation.
>   Or,
> is there a way to change the RPATH settings such that the LD_LIBRARY_PATH
> paths are incorporated inside the RPATHs of executables, in front of the
> default /usr/lib64/ — frankly, why is libtool adding this standard location
> anyway?
>
> Regards,
>
> Hans
>
>
>
>


-- 
James DeFelice
585.241.9488 (voice)
650.649.6071 (fax)


Re: Task still 'active' after TASK_FINISHED status

2015-11-25 Thread James Vanns
I don't know what the Chronos default is - but in the recent case I posted
about, we use whatever the Chronos default is I just checked their
documentation and it states they use the Mesos command executor.

As far as our own framework, which exhibits similar behaviour, we don't
explicitly specify one (but we do use ContainerInfo::DockerInfo). We do set
a command for the task to run so I guess that assumes a CommandExecutor?

Cheers,

Jim


On 25 November 2015 at 15:51, David Greenberg <dsg123456...@gmail.com>
wrote:

> If you're using a custom executor, this could happen if you don't actually
> exit the executor process. Is this using CommandExecutor or a custom one?
>
> On Wed, Nov 25, 2015 at 5:01 AM James Vanns <jvanns@gmail.com> wrote:
>
>> Er, I could. At the moment it's pretty huge so maybe I'll just try and
>> trim it down a bit. I've noticed that Chronos does the same, actually.
>> There is a task that is 'active' and still holding onto resources yet it
>> has already completed unsuccessfully with TASK_FAILED (16hrs ago!) state.
>> Attached is a log of the events from the mesos slave that executed this
>> particular Chronos task (before it continues to forward the same state over
>> and over). Note that the last pair of lines is repeated ad-infinitum. I can
>> confirm that this Chronos framework with the same ID is still running.
>>
>> Sorry to switch frameworks suddenly - this was simpler because it was one
>> task instead of 100s.
>>
>> Jim
>>
>> On 24 November 2015 at 17:57, Vinod Kone <vinodk...@gmail.com> wrote:
>>
>>> Can you paste the logs?
>>>
>>> On Tue, Nov 24, 2015 at 2:16 AM, James Vanns <jvanns@gmail.com>
>>> wrote:
>>>
>>>> Hi again list.
>>>>
>>>> Mesos 0.24
>>>> C++ Framework (still using the Protobufs based comms, not REST)
>>>>
>>>> My framework appears to be holding onto offers (somehow) from tasks
>>>> that are finished!? I don't understand why. The task comprises of a shell
>>>> command that executes within a docker container.
>>>> The return code to the OS from the shell command is indeed zero for
>>>> success, which Mesos honours and transitions to TASK_FINISHED state.
>>>> However, using the UI these still register as 'active' (though acknowledged
>>>> as FINISHED) and thus the resources are not yet freed.
>>>>
>>>> Any pointers appreciated!
>>>>
>>>> Cheers,
>>>>
>>>> Jim
>>>>
>>>> --
>>>> Senior Code Pig
>>>> Industrial Light & Magic
>>>>
>>>
>>>
>>
>>
>> --
>> --
>> Senior Code Pig
>> Industrial Light & Magic
>>
>


-- 
--
Senior Code Pig
Industrial Light & Magic


Re: Task still 'active' after TASK_FINISHED status

2015-11-25 Thread James Vanns
Er, I could. At the moment it's pretty huge so maybe I'll just try and trim
it down a bit. I've noticed that Chronos does the same, actually. There is
a task that is 'active' and still holding onto resources yet it has already
completed unsuccessfully with TASK_FAILED (16hrs ago!) state. Attached is a
log of the events from the mesos slave that executed this particular
Chronos task (before it continues to forward the same state over and over).
Note that the last pair of lines is repeated ad-infinitum. I can confirm
that this Chronos framework with the same ID is still running.

Sorry to switch frameworks suddenly - this was simpler because it was one
task instead of 100s.

Jim

On 24 November 2015 at 17:57, Vinod Kone <vinodk...@gmail.com> wrote:

> Can you paste the logs?
>
> On Tue, Nov 24, 2015 at 2:16 AM, James Vanns <jvanns@gmail.com> wrote:
>
>> Hi again list.
>>
>> Mesos 0.24
>> C++ Framework (still using the Protobufs based comms, not REST)
>>
>> My framework appears to be holding onto offers (somehow) from tasks that
>> are finished!? I don't understand why. The task comprises of a shell
>> command that executes within a docker container.
>> The return code to the OS from the shell command is indeed zero for
>> success, which Mesos honours and transitions to TASK_FINISHED state.
>> However, using the UI these still register as 'active' (though acknowledged
>> as FINISHED) and thus the resources are not yet freed.
>>
>> Any pointers appreciated!
>>
>> Cheers,
>>
>> Jim
>>
>> --
>> Senior Code Pig
>> Industrial Light & Magic
>>
>
>


-- 
--
Senior Code Pig
Industrial Light & Magic
I1124 16:57:10.782155   104 slave.cpp:1739] Sending queued task 'ct:1448384217181:2:olio:' to executor 'ct:1448384217181:2:olio:' of framework 20151119-165710-4000912556-5050-1-0052
I1124 16:57:11.490393   107 slave.cpp:2696] Handling status update TASK_RUNNING (UUID: a3dd3fc7-c5cf-4d7a-bbf9-efe8e799b54a) for task ct:1448384217181:2:olio: of framework 20151119-165710-4000912556-5050-1-0052 from executor(1)@172.20.121.112:43306
I1124 16:57:11.490633   105 status_update_manager.cpp:322] Received status update TASK_RUNNING (UUID: a3dd3fc7-c5cf-4d7a-bbf9-efe8e799b54a) for task ct:1448384217181:2:olio: of framework 20151119-165710-4000912556-5050-1-0052
I1124 16:57:11.491034   105 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_RUNNING (UUID: a3dd3fc7-c5cf-4d7a-bbf9-efe8e799b54a) for task ct:1448384217181:2:olio: of framework 20151119-165710-4000912556-5050-1-0052
I1124 16:57:11.49   107 slave.cpp:2696] Handling status update TASK_FAILED (UUID: ab530800-346a-4ba5-9455-7d35c90adf19) for task ct:1448384217181:2:olio: of framework 20151119-165710-4000912556-5050-1-0052 from executor(1)@172.20.121.112:43306
I1124 16:57:11.493294   104 slave.cpp:2975] Forwarding the update TASK_RUNNING (UUID: a3dd3fc7-c5cf-4d7a-bbf9-efe8e799b54a) for task ct:1448384217181:2:olio: of framework 20151119-165710-4000912556-5050-1-0052 to master@172.20.121.238:5050
I1124 16:57:11.493482   104 slave.cpp:2905] Sending acknowledgement for status update TASK_RUNNING (UUID: a3dd3fc7-c5cf-4d7a-bbf9-efe8e799b54a) for task ct:1448384217181:2:olio: of framework 20151119-165710-4000912556-5050-1-0052 to executor(1)@172.20.121.112:43306
I1124 16:57:11.499173   105 status_update_manager.cpp:394] Received status update acknowledgement (UUID: a3dd3fc7-c5cf-4d7a-bbf9-efe8e799b54a) for task ct:1448384217181:2:olio: of framework 20151119-165710-4000912556-5050-1-0052
I1124 16:57:11.499286   105 status_update_manager.cpp:826] Checkpointing ACK for status update TASK_RUNNING (UUID: a3dd3fc7-c5cf-4d7a-bbf9-efe8e799b54a) for task ct:1448384217181:2:olio: of framework 20151119-165710-4000912556-5050-1-0052
I1124 16:57:11.561130   107 status_update_manager.cpp:322] Received status update TASK_FAILED (UUID: ab530800-346a-4ba5-9455-7d35c90adf19) for task ct:1448384217181:2:olio: of framework 20151119-165710-4000912556-5050-1-0052
I1124 16:57:11.561177   107 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_FAILED (UUID: ab530800-346a-4ba5-9455-7d35c90adf19) for task ct:1448384217181:2:olio: of framework 20151119-165710-4000912556-5050-1-0052
I1124 16:57:11.562505   109 slave.cpp:2975] Forwarding the update TASK_FAILED (UUID: ab530800-346a-4ba5-9455-7d35c90adf19) for task ct:1448384217181:2:olio: of framework 20151119-165710-4000912556-5050-1-0052 to master@172.20.121.238:5050
I1124 16:57:11.562604   109 slave.cpp:2905] Sending acknowledgement for status update TASK_FAILED (UUID: ab530800-346a-4ba5-9455-7d35c90adf19) for task ct:1448384217181:2:olio: of framework 20151119-165710-4000912556-5050-1-0052 to executor(1)@172.20.121.112:43306
I1124 16:57:12.566434   107 docker.cpp:1584] Executor for container 'db9f7671-13cb-430f-b74f-3fc1df898f89' has 

Task still 'active' after TASK_FINISHED status

2015-11-24 Thread James Vanns
Hi again list.

Mesos 0.24
C++ Framework (still using the Protobufs based comms, not REST)

My framework appears to be holding onto offers (somehow) from tasks that
are finished!? I don't understand why. The task comprises of a shell
command that executes within a docker container.
The return code to the OS from the shell command is indeed zero for
success, which Mesos honours and transitions to TASK_FINISHED state.
However, using the UI these still register as 'active' (though acknowledged
as FINISHED) and thus the resources are not yet freed.

Any pointers appreciated!

Cheers,

Jim

--
Senior Code Pig
Industrial Light & Magic


statusUpdate() duplicate messages?

2015-11-18 Thread James Vanns
Hello list.

We have an experimental framework (C++ API) based on Mesos 0.24 and we're
seeing duplicate task status messages -- eg. 2 'FINISHED' messages for a
single task. This may well be normal behaviour but I wasn't prepared for
it. Could someone point me in the direction of a decent description on
status updates/messages somewhere in the Mesos documentation? Or explain
the following;

a) Is this normal (it's not just the FINISHED state)?
b) What might cause this behaviour (it's intermittent)?
c) I do not explicitly acknowledge receipt of these messages - should I!?
d) Should I treat these status update messages as reliable and robust!?
e) Where can I learn more about this kind of internal detail?

Cheers,

Jim

--
Senior Code Pig
Industrial Light & Magic


Re: statusUpdate() duplicate messages?

2015-11-18 Thread James Vanns
Thanks very much for the prompt response, Tom. I shall go and read up on
reconciliation (I'd expected there to be something like this to read). And
to my knowledge, no I don't explicitly disable the implicit status
acknowledgement ;)

Cheers,

Jim


On 18 November 2015 at 12:24, Tom Arnfeld <t...@duedil.com> wrote:

> When you construct the scheduler, are you disabling implicit
> acknowledgements?
>
> https://github.com/apache/mesos/blob/master/include/mesos/scheduler.hpp#L373
>
> I’d suggest having a read over this document, it explains some of this ->
> http://mesos.apache.org/documentation/latest/reconciliation/
>
> a) Mesos may re-send messages if you don’t acknowledge them, and task
> status messages are guaranteed *at least once*
> c) If you disable implicit status acknowledgement, yep
> d) You should, they are guaranteed to be delivered *at some point* *at
> least once* by the slave / master. To keep your framework in sync with
> the cluster it is recommended to reconcile tasks often (as explained in the
> document above)
> e) http://mesos.apache.org/documentation/latest/reconciliation/
>
> Hope that helps, and I think that’s all correct! The docs will be able to
> clarify better :-)
>
> On 18 Nov 2015, at 12:09, James Vanns <jvanns@gmail.com> wrote:
>
> Hello list.
>
> We have an experimental framework (C++ API) based on Mesos 0.24 and we're
> seeing duplicate task status messages -- eg. 2 'FINISHED' messages for a
> single task. This may well be normal behaviour but I wasn't prepared for
> it. Could someone point me in the direction of a decent description on
> status updates/messages somewhere in the Mesos documentation? Or explain
> the following;
>
> a) Is this normal (it's not just the FINISHED state)?
> b) What might cause this behaviour (it's intermittent)?
> c) I do not explicitly acknowledge receipt of these messages - should I!?
> d) Should I treat these status update messages as reliable and robust!?
> e) Where can I learn more about this kind of internal detail?
>
> Cheers,
>
> Jim
>
> --
> Senior Code Pig
> Industrial Light & Magic
>
>
>


-- 
--
Senior Code Pig
Industrial Light & Magic


Re: Marathon 0.11.1 - Mesos 0.25 - Mesos-DNS 0.4.0

2015-11-03 Thread James DeFelice
The default value of IPSources doesn't have `docker` listed. As long as
that's not in the list you shouldn't have had a problem, unless some bad
actor was writing the wrong labels into the task. I don't see support for
NetworkInfos (`netinfos`) in marathon yet. Which means that `host` should
have been the fallback.

Did you, by chance, have `docker` listed in IPSources at any point?


On Tue, Nov 3, 2015 at 12:04 PM, John Omernik <j...@omernik.com> wrote:

> I used
>
> "IPSources": ["host", "netinfo", "mesos"]
>
>
> With the thought that I would preference for the host at this point. When
> network isolation works in Marathon, then I will likely switch to netinfo.
>
> On Mon, Nov 2, 2015 at 7:28 PM, James DeFelice <james.defel...@gmail.com>
> wrote:
>
>> What settings worked for you? We did aim for least surprise. Sounds like
>> we missed a bit. We're happy to accept suggestions for improvement via gh
>> issues filed against the mesos-dns repo.
>> On Oct 29, 2015 7:39 AM, "John Omernik" <j...@omernik.com> wrote:
>>
>>> That is good to know, however, I would challenge the group on something
>>> like this not being bug based on the documentation.  When a change in
>>> mesos-dns, and what fields it looks at is not affected by the mesos-dns
>>> component, but instead other components in a way that could have serious
>>> negative impacts on folks who are running this, there should be some
>>> fanfare there about changes.  Also, I would advocate that in mesos-dns the
>>> default should have been the same as previous releases (which I would
>>> assume was host ip) as default, then allow people who are aware of the
>>> underpinnings to make the change.
>>>
>>> On Wed, Oct 28, 2015 at 3:02 PM, Grzegorz Graczyk <gregor...@gmail.com>
>>> wrote:
>>>
>>>> It's not a bug, it's a feature -
>>>> http://mesosphere.github.io/mesos-dns/docs/configuration-parameters.html 
>>>> look
>>>> at IPSources config
>>>>
>>>> śr., 28.10.2015 o 15:59 użytkownik John Omernik <j...@omernik.com>
>>>> napisał:
>>>>
>>>>> If I rolled back mesos-dns to v0.2.0 (on the releases page) then it
>>>>> pulls the right IP address..   (Mesos-dns version is the easiest of the
>>>>> three to change)
>>>>>
>>>>> John
>>>>>
>>>>> On Wed, Oct 28, 2015 at 9:52 AM, John Omernik <j...@omernik.com>
>>>>> wrote:
>>>>>
>>>>>> So, the issues that are listed appear to be resolved with marathon
>>>>>> 0.11.1, and the mesos-dns issue is not listed at all.
>>>>>>
>>>>>> Note, I tried mesos-dns 0.3.0 and that has the same problem as 0.4.0.
>>>>>>
>>>>>> On Wed, Oct 28, 2015 at 9:46 AM, John Omernik <j...@omernik.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I will check out those issues and report back.
>>>>>>>
>>>>>>> On Wed, Oct 28, 2015 at 9:42 AM, craig w <codecr...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I've had no issue with the following combination:
>>>>>>>>
>>>>>>>> MesosDNS 0.4.0
>>>>>>>> Marathon 0.11.0
>>>>>>>> Mesos 0.24.1
>>>>>>>>
>>>>>>>> I've been waiting to upgrade to Mesos 0.25.0 because of issues
>>>>>>>> mentioned in the mesos mailing list regarding Marathon 0.11.x and Mesos
>>>>>>>> 0.25.0
>>>>>>>>
>>>>>>>> On Wed, Oct 28, 2015 at 10:38 AM, John Omernik <j...@omernik.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey all -
>>>>>>>>>
>>>>>>>>> I am cross posting this because it's a number of moving parts that
>>>>>>>>> could be at issue here (Mesos, Mesos-dns, and/or Marathon).
>>>>>>>>>
>>>>>>>>> Basically: At the version combination in Subject, the IP that is
>>>>>>>>> registered in mesos-dns for Docker containers running in Marathon is 
>>>>>>>>> the
>>>>>>>>> internal (container) IP address of the docker (in bridged mode) not 
>>>>>>

Re: Marathon 0.11.1 - Mesos 0.25 - Mesos-DNS 0.4.0

2015-11-02 Thread James DeFelice
What settings worked for you? We did aim for least surprise. Sounds like we
missed a bit. We're happy to accept suggestions for improvement via gh
issues filed against the mesos-dns repo.
On Oct 29, 2015 7:39 AM, "John Omernik"  wrote:

> That is good to know, however, I would challenge the group on something
> like this not being bug based on the documentation.  When a change in
> mesos-dns, and what fields it looks at is not affected by the mesos-dns
> component, but instead other components in a way that could have serious
> negative impacts on folks who are running this, there should be some
> fanfare there about changes.  Also, I would advocate that in mesos-dns the
> default should have been the same as previous releases (which I would
> assume was host ip) as default, then allow people who are aware of the
> underpinnings to make the change.
>
> On Wed, Oct 28, 2015 at 3:02 PM, Grzegorz Graczyk 
> wrote:
>
>> It's not a bug, it's a feature -
>> http://mesosphere.github.io/mesos-dns/docs/configuration-parameters.html look
>> at IPSources config
>>
>> śr., 28.10.2015 o 15:59 użytkownik John Omernik 
>> napisał:
>>
>>> If I rolled back mesos-dns to v0.2.0 (on the releases page) then it
>>> pulls the right IP address..   (Mesos-dns version is the easiest of the
>>> three to change)
>>>
>>> John
>>>
>>> On Wed, Oct 28, 2015 at 9:52 AM, John Omernik  wrote:
>>>
 So, the issues that are listed appear to be resolved with marathon
 0.11.1, and the mesos-dns issue is not listed at all.

 Note, I tried mesos-dns 0.3.0 and that has the same problem as 0.4.0.

 On Wed, Oct 28, 2015 at 9:46 AM, John Omernik  wrote:

> I will check out those issues and report back.
>
> On Wed, Oct 28, 2015 at 9:42 AM, craig w  wrote:
>
>> I've had no issue with the following combination:
>>
>> MesosDNS 0.4.0
>> Marathon 0.11.0
>> Mesos 0.24.1
>>
>> I've been waiting to upgrade to Mesos 0.25.0 because of issues
>> mentioned in the mesos mailing list regarding Marathon 0.11.x and Mesos
>> 0.25.0
>>
>> On Wed, Oct 28, 2015 at 10:38 AM, John Omernik 
>> wrote:
>>
>>> Hey all -
>>>
>>> I am cross posting this because it's a number of moving parts that
>>> could be at issue here (Mesos, Mesos-dns, and/or Marathon).
>>>
>>> Basically: At the version combination in Subject, the IP that is
>>> registered in mesos-dns for Docker containers running in Marathon is the
>>> internal (container) IP address of the docker (in bridged mode) not the
>>> nodes. This obviously causes issues.  Note this doesn't happen when the
>>> Marathon application is non-Docker.
>>>
>>> I was running Mesos-dns 0.4.0 on a cluster running Mesos 0.24.0 and
>>> Marathon 0.10.0 and I upgraded to Mesos 0.25.0 and Marathon 0.11.1 and
>>> noticed this behavior happening.
>>>
>>> I thought that was odd because I have another cluster that was
>>> running Mesos 0.25.0 and Marathon 0.11.1 and it wasn't happening, until 
>>> I
>>> realized that I hadn't upgraded Mesos-dns lately, I upgraded to 
>>> Mesos-dns
>>> 0.4.0 and the problem started occurring.
>>>
>>> Is there a setting that I need to use the external IP of the
>>> container? Is this issue known? Is there a workaround? This is pretty 
>>> major
>>> for Docker running on Marathon and using Mesos-dns for service 
>>> discovery.
>>>
>>> John Omernik
>>>
>>>
>>>
>>
>>
>> --
>>
>> https://github.com/mindscratch
>> https://www.google.com/+CraigWickesser
>> https://twitter.com/mind_scratch
>> https://twitter.com/craig_links
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "marathon-framework" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to marathon-framework+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

>>> --
>> You received this message because you are subscribed to the Google Groups
>> "marathon-framework" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to marathon-framework+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>


Re: Tasks that run docker images consistently fail while downloading

2015-10-28 Thread James Vanns
Ahem. Yes. I had meant 300,000 ;)

Cheers,

Jim


On 28 October 2015 at 12:10, Rad Gruchalski <ra...@gruchalski.com> wrote:

> However, it should be 30, not 3. It’s milliseconds - 5 mins in
> milliseconds is 30.
>
> Optional. Default: 30 (5 minutes)
>
> Kind regards,
> Radek Gruchalski
> ra...@gruchalski.com <ra...@gruchalski.com>
> de.linkedin.com/in/radgruchalski/
>
>
> *Confidentiality:*This communication is intended for the above-named
> person and may be confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
> On Wednesday, 28 October 2015 at 13:08, Rad Gruchalski wrote:
>
> Nobody getting those today ;) Good catch. Worth keeping in mind!
>
> Kind regards,
> Radek Gruchalski
> ra...@gruchalski.com <ra...@gruchalski.com>
> de.linkedin.com/in/radgruchalski/
>
>
> *Confidentiality:*This communication is intended for the above-named
> person and may be confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
> On Wednesday, 28 October 2015 at 13:06, James Vanns wrote:
>
> I shall fix my own problem it's embarrassing. Top marks to those of
> you that notice I supplied 3000 instead of 30000 (which I understand is
> actually the default anyway) to task_launch_timeout!
>
> Jim
>
>
> On 28 October 2015 at 10:21, James Vanns <jvanns@gmail.com> wrote:
>
> Hi all.
>
> Mesos version = 0.23.0-1.0.ubuntu1404 (mesosphere APT repo)
> Marathon version = 0.10.1 (mesosphere APT repo)
>
> Hopefully this is a simple one for someone to answer, though I couldn't
> find anything immediately
> obvious in the documentation. We're trialling Mesos in a cloud (EC2/GCE)
> environment and the one
> thing that continues to bite us in the ass is this; continued task
> failures until the docker image is
> fully downloaded! Why is this!? Some of our images a small (say 200MB),
> some much larger (2GB)
> due to the nature of the software packages we're containerising.
> Regardless of this size, they fail the
> first dozen (or more) times until one of the slaves has pulled the image.
> Why is there an apparent
> hard time-out and how can I avoid it? I don't want the task to register as
> a fail - it hasn't even had a
> chance to run yet! Up until now we've just been tolerating the bouncing
> around of these tasks but it's
> now reached a point where it's darn annoying ;)
>
> I've tried setting executor_registration_timeout to '5mins' but this made
> no apparent difference (every
> minute the task is killed still). I should note that these tasks are
> launched using the Marathon
> framework and I've tried setting 'task_launch_timeout' to '3000' and
> again, it makes no difference.
>
> Based on a brief glance of a mesos slave log file it seems the master
> instructs the slave to kill the task off after 1 minute.
>
> Please advise.
>
> Cheers,
>
> Jim
>
> --
> Senior Code Pig
> Industrial Light & Magic
>
>
>
>
> --
> --
> Senior Code Pig
> Industrial Light & Magic
>
>
>
>


-- 
--
Senior Code Pig
Industrial Light & Magic


Re: Tasks that run docker images consistently fail while downloading

2015-10-28 Thread James Vanns
I shall fix my own problem it's embarrassing. Top marks to those of you
that notice I supplied 3000 instead of 3 (which I understand is
actually the default anyway) to task_launch_timeout!

Jim


On 28 October 2015 at 10:21, James Vanns <jvanns@gmail.com> wrote:

> Hi all.
>
> Mesos version = 0.23.0-1.0.ubuntu1404 (mesosphere APT repo)
> Marathon version = 0.10.1 (mesosphere APT repo)
>
> Hopefully this is a simple one for someone to answer, though I couldn't
> find anything immediately
> obvious in the documentation. We're trialling Mesos in a cloud (EC2/GCE)
> environment and the one
> thing that continues to bite us in the ass is this; continued task
> failures until the docker image is
> fully downloaded! Why is this!? Some of our images a small (say 200MB),
> some much larger (2GB)
> due to the nature of the software packages we're containerising.
> Regardless of this size, they fail the
> first dozen (or more) times until one of the slaves has pulled the image.
> Why is there an apparent
> hard time-out and how can I avoid it? I don't want the task to register as
> a fail - it hasn't even had a
> chance to run yet! Up until now we've just been tolerating the bouncing
> around of these tasks but it's
> now reached a point where it's darn annoying ;)
>
> I've tried setting executor_registration_timeout to '5mins' but this made
> no apparent difference (every
> minute the task is killed still). I should note that these tasks are
> launched using the Marathon
> framework and I've tried setting 'task_launch_timeout' to '3000' and
> again, it makes no difference.
>
> Based on a brief glance of a mesos slave log file it seems the master
> instructs the slave to kill the task off after 1 minute.
>
> Please advise.
>
> Cheers,
>
> Jim
>
> --
> Senior Code Pig
> Industrial Light & Magic
>



-- 
--
Senior Code Pig
Industrial Light & Magic


Re: Tasks that run docker images consistently fail while downloading

2015-10-28 Thread James Vanns
Yes I have - I mention that in my Email ;) I set it to the same as the '
executor_registration_timeout'. Both effectively set to 5 minutes - but my
tasks are killed off after 1 minute without being allowed to fully download
the image.

Jim

On 28 October 2015 at 10:26, Rad Gruchalski <ra...@gruchalski.com> wrote:

> Jim,
>
> Have you tried —task_launch_timeout? From:
> https://mesosphere.github.io/marathon/docs/native-docker.html
>
> Configure Marathon
>
>1. Increase the Marathon command line option
><https://mesosphere.github.io/marathon/docs/command-line-flags.html>
>--task_launch_timeout to at least the executor timeout, in
>milliseconds, you set on your slaves in the previous step.
>
>
> Kind regards,
> Radek Gruchalski
> ra...@gruchalski.com <ra...@gruchalski.com>
> de.linkedin.com/in/radgruchalski/
>
>
> *Confidentiality:*This communication is intended for the above-named
> person and may be confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
> On Wednesday, 28 October 2015 at 11:21, James Vanns wrote:
>
> Hi all.
>
> Mesos version = 0.23.0-1.0.ubuntu1404 (mesosphere APT repo)
> Marathon version = 0.10.1 (mesosphere APT repo)
>
> Hopefully this is a simple one for someone to answer, though I couldn't
> find anything immediately
> obvious in the documentation. We're trialling Mesos in a cloud (EC2/GCE)
> environment and the one
> thing that continues to bite us in the ass is this; continued task
> failures until the docker image is
> fully downloaded! Why is this!? Some of our images a small (say 200MB),
> some much larger (2GB)
> due to the nature of the software packages we're containerising.
> Regardless of this size, they fail the
> first dozen (or more) times until one of the slaves has pulled the image.
> Why is there an apparent
> hard time-out and how can I avoid it? I don't want the task to register as
> a fail - it hasn't even had a
> chance to run yet! Up until now we've just been tolerating the bouncing
> around of these tasks but it's
> now reached a point where it's darn annoying ;)
>
> I've tried setting executor_registration_timeout to '5mins' but this made
> no apparent difference (every
> minute the task is killed still). I should note that these tasks are
> launched using the Marathon
> framework and I've tried setting 'task_launch_timeout' to '3000' and
> again, it makes no difference.
>
> Based on a brief glance of a mesos slave log file it seems the master
> instructs the slave to kill the task off after 1 minute.
>
> Please advise.
>
> Cheers,
>
> Jim
>
> --
> Senior Code Pig
> Industrial Light & Magic
>
>
>


-- 
--
Senior Code Pig
Industrial Light & Magic


Re: Batch/queue frameworks?

2015-10-07 Thread James DeFelice
The OP might also be interested in Stolos:
https://github.com/sailthru/stolos

combined with Relay: https://github.com/sailthru/relay


On Wed, Oct 7, 2015 at 8:15 AM, Clarke, Trevor <tcla...@ball.com> wrote:

> I'm currently working on this sort of framework. Unfortunately, source is
> not currently available but there is a plan to open source in the next
> couple of months. I'm not sure if your need is immediate or if it can wait
> for a bit. The framework handles jobs in docker containers with pre and
> post steps (copy data into the node, products out, etc.) Individual jobs
> can be strung together in a DAG for complex processing. Directories can be
> watched for new data and jobs can be started in response to this data.
>
> 
> From: Brian Candler [b.cand...@pobox.com]
> Sent: Wednesday, October 07, 2015 3:56 AM
> To: user@mesos.apache.org
> Subject: Batch/queue frameworks?
>
> Are there any open-source job queue/batch systems which run under Mesos?
> I am thinking of things like HTCondor, Torque etc.
>
> The requirement is to be able to:
> - define an overall job as a set of sub-tasks (could be many thousands)
> - put sub-tasks into a queue; execute tasks from the queue
> - dependencies: don't add a sub-task into the queue until its precursors
> have completed successfully
> - restart: after an error, be able to restart the job but skipping those
> sub-tasks which completed successfully
> - preferably handle short-lived tasks efficiently (of order of 10
> seconds duration)
>
> Clearly it's possible to write a framework to do this, but I don't want
> to re-invent the wheel if it has been done already.
>
> Thanks,
>
> Brian.
>
> P.S. I found Chronos, but it doesn't seem a good match. As far as I can
> see, it's intended for applications where you pre-define a bunch of
> tasks (via GUI? via REST?) and then trigger them periodically.
>
>
>
> This message and any enclosures are intended only for the addressee.
> Please
> notify the sender by email if you are not the intended recipient.  If you
> are
> not the intended recipient, you may not use, copy, disclose, or distribute
> this
> message or its contents or enclosures to any other person and any such
> actions
> may be unlawful.  Ball reserves the right to monitor and review all
> messages
> and enclosures sent to or from this email address.
>



-- 
James DeFelice
585.241.9488 (voice)
650.649.6071 (fax)


Re: OS X build

2015-09-27 Thread James Peach

> On Sep 27, 2015, at 4:15 PM, Vaibhav Khanduja <vaibhavkhand...@gmail.com> 
> wrote:
> 
> Probably yes, 
> 
> The issue which I am pointing out is with the configure script not accepting 
> option "—with-arp"

OK then I'm confused because there is a --with-apr option, and it works AFAICT

jpeach$ ./configure --help | grep apr
  --with-apr=[=DIR]   specify where to locate the apr-1 library

> 
> On Sat, Sep 26, 2015 at 9:26 PM, James Peach <jor...@gmail.com> wrote:
> 
> > On Sep 26, 2015, at 12:01 PM, Vaibhav Khanduja <vaibhavkhand...@gmail.com> 
> > wrote:
> >
> > I am running into issues with build on my MAC - OSX … the configure scripts 
> > complaints about libapr-1 not present. I was able to find a workaround by 
> > passing configure with —with-apr option. Looks like the script checks for 
> > variable to be valid shell variable, if not it is rejected.
> >
> > I was able workaround with having quotes and missing a “-“ for the variable
> >
> >  ../configure -disable-python --disable-java 
> > "-with-apr=/usr/local/Cellar/apr/1.5.2/libexec/“
> >
> > the configure —help though suggests to use —with-apr
> 
> AFAICT you are supposed to use apr-1-config to fish out the libapr path when 
> using Homebrew. I think that is would be reasonable for the Mesos build to 
> just automatically use apr-1-config if it is present.
> 
> ./configure --with-apr=$(apr-1-config --prefix)
> 
> >
> >
> > am I missing something here?
> >
> > Thx
> 
> 



Re: OS X build

2015-09-26 Thread James Peach

> On Sep 26, 2015, at 12:01 PM, Vaibhav Khanduja  
> wrote:
> 
> I am running into issues with build on my MAC - OSX … the configure scripts 
> complaints about libapr-1 not present. I was able to find a workaround by 
> passing configure with —with-apr option. Looks like the script checks for 
> variable to be valid shell variable, if not it is rejected.
> 
> I was able workaround with having quotes and missing a “-“ for the variable
> 
>  ../configure -disable-python --disable-java 
> "-with-apr=/usr/local/Cellar/apr/1.5.2/libexec/“
> 
> the configure —help though suggests to use —with-apr

AFAICT you are supposed to use apr-1-config to fish out the libapr path when 
using Homebrew. I think that is would be reasonable for the Mesos build to just 
automatically use apr-1-config if it is present.

./configure --with-apr=$(apr-1-config --prefix)

>  
> 
> am I missing something here?
> 
> Thx



Mesos master metrics endpoint?

2015-09-23 Thread James Vanns
Hi all. It appears there is a glaring omission in the 'Tasks' section of
the following doc;

http://mesos.apache.org/documentation/latest/monitoring/

Shouldn't there be a 'Tasks waiting' metric!? We generally have tasks
hanging around for
a while because their resource requests can't (yet) be met by any offer -
so we'd like a
counter for how many!? Did I miss something?

Also, what about near-real-time metrics of a running task? Eg. resource
consumption (n% of
CPUs asked for, 8GB/12GB allocated etc.). Can we get that information?

Cheers,

Jim

--
Senior Code Pig
Industrial Light & Magic


Re: Building portable binaries

2015-09-17 Thread James Peach

> On Sep 17, 2015, at 4:33 PM, F21  wrote:
> 
> Is there anyway to build portable binaries for mesos?
> 
> Currently, I have tried building my own libsvn, libsasl2, libcurl, libapr and 
> then built mesos using the following:
> 
> ../configure CC=gcc-4.8 CXX=g++-4.8 
> LD_LIBRARY_PATH=/tmp/mesos-build/sasl2/lib 
> SASL_PATH=/tmp/mesos-build/sasl2/lib/sasl2 --prefix=/tmp/mesos-build/mesos 
> --with-svn=/tmp/mesos-build/svn --with-apr=/tmp/mesos-build/apr 
> --with-sasl=/tmp/mesos-build/sasl2/ --with-curl=/tmp/mesos-build/curl
> make
> make install
> 
> I then compress /tmp/mesos-build/mesos into an archive and distribute it to 
> my machines. The only problem is that the build seems to be buggy. For 
> example, I've been experiencing a containerization issues where the executors 
> will crash, but not output anything useful to stderr and stdout. See 
> https://github.com/mesosphere/hdfs/issues/194
> 
> Is there a definite way to build portable binaries that I can easily copy to 
> another machine to run?

You could do a statically linked build by doing configure --enable-static 
--disable-shared. I don't know whether that is supported in the Mesos build, 
but it is a standard automake feature, so if it fails it should be fixable.

J



Re: Recommended way to discover current master

2015-08-31 Thread James Peach

> On Aug 31, 2015, at 10:25 AM, Philip Weaver  wrote:
> 
> My framework knows the list of zookeeper hosts and the list of mesos master 
> hosts.
> 
> I can think of a few ways for the framework to figure out which host is the 
> current master. What would be the best? Should I check in zookeeper directly? 
> Does the mesos library expose an interface to discover the master from 
> zookeeper or otherwise? Should I just try each possible master until one 
> responds?

If you want to do it the HTTP way, just hit the /master/redirect enndpoint on 
any master that you can reach.

> 
> Apologies if this is already well documented, but I wasn't able to find it. 
> Thanks!
> 
> - Philip
> 



  1   2   >