On augmenting TLS configuration options in libprocess

2019-05-24 Thread Alex Rukletsov
Folks,

We reviewed TLS configuration options in libprocess and came up with the
following proposal [1] to allow for certificate verification in client mode
only.

In short, the proposal suggests to add two flags to libprocess so that it
can be configured to:
* always require presence and verify server certificates,
* never request client certificates,
* validate hostname using OpenSSL calls.

Please review.

[1]
https://docs.google.com/document/d/1O3q7UOXVGNw81xOkRNFPzrtbC__D-N_D_mwV6D--y0k/edit


Re: '*.json' endpoints removed in 1.7

2019-05-11 Thread Alex Rukletsov
Before we decide, I'd like to propose another view angle. Thanks to the
removal of the endpoint aliases, a widely used but only occasionally
maintained mesos-dns have been updated and a newer version will be released
soon [1] — thanks to jdef. I understand the frustration when software stops
working after an update for silly reasons like removing endpoint aliases,
but at the same time it can be an incentive to update other components in
the ecosystem as well, switching not just from one endpoint to another, but
bringing other changes together into the release.

[1] https://github.com/mesosphere/mesos-dns/releases/tag/v0.7.0-rc2

On Fri, May 10, 2019 at 5:03 PM Vinod Kone  wrote:

> I propose that we revert this change and keep the ".json" endpoints in
> master branch and 1.8.x
>
> My reasoning is that, we have ecosystem components (e.g., mesos-dns which
> is yet to have a release with fix) and anecdotally a bunch of custom
> tooling at user sites that depend on these ".json" endpoints (esp.
> /state.json). The amount of techdebt that we saved or consistency we
> achieved in the codebase by doing this is not worth the tradeoff of
> breaking some user/tooling, in my opinion. We could revisit this if and
> when we do a Mesos 2.0.
>
> On Wed, Aug 8, 2018 at 9:25 AM Alex Rukletsov  wrote:
>
> > Folks,
> >
> > The long ago deprecated '*.json' endpoints will be removed in Mesos
> 1.7.0.
> > Please use their non-'.json' counterparts instead.
> >
> > Commit:
> >
> https://github.com/apache/mesos/commit/42551cb5290b7b04101f7d800b4b8fd573e47b91
> > JIRA ticket: https://issues.apache.org/jira/browse/MESOS-4509
> >
> > Alex.
> >
>


Re: [VOTE] Release Apache Mesos 1.4.3 (rc1)

2019-01-28 Thread Alex Rukletsov
This will be the last official 1.4.x release. Even though we agreed to keep
the branch and occasionally back port fixes to it post last release, maybe
it makes sense to include all pending patches into 1.4.3? I see for example
Gilbert added the fix for MESOS-9532 [1]. We were also considering back
porting other test fixes [2] to 1.4.x branch.

[1] https://github.com/apache/mesos/commits/1.4.x
[2] https://gist.github.com/rukletsov/a2a7bedad58010ab8adf209cdc5eef0c

On Fri, Jan 25, 2019 at 11:12 PM Meng Zhu  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.4.3.
>
> 1.4.3 includes the following:
>
> 
> https://issues.apache.org/jira/issues/?filter=12345433
>
> The CHANGELOG for the release is available at:
>
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.4.3-rc1
>
> 
>
> The candidate for Mesos 1.4.3 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.4.3-rc1/mesos-1.4.3.tar.gz
>
> The tag to be voted on is 1.4.3-rc1:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.4.3-rc1
>
> The SHA512 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.4.3-rc1/mesos-1.4.3.tar.gz.sha512
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/1.4.3-rc1/mesos-1.4.3.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1244
>
> Please vote on releasing this package as Apache Mesos 1.4.3!
>
> The vote is open until Mon Jan 30th 14:02:55 PST 2019 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.4.3
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Meng
>


Re: Join us at MesosCon 2018 next week!

2018-11-07 Thread Alex Rukletsov
I'd like to thank everyone involved in organising this MesosCon, and
especially Gastón, Jörg, and Andy. I enjoyed the laid-back "underground"
style this year; it was easy to engage in conversations with users and
Mesos developers. Looking forward to the next MesosCon!

Alex

On Thu, Nov 1, 2018 at 10:07 PM Vaibhav Khanduja 
wrote:

> Thank You,
>
> I am looking at the schedule of events. There is a hackathon on Wednesday;
> are there more details available? When to register etc?
>
> On Thu, Nov 1, 2018 at 11:37 AM Gastón Kleiman 
> wrote:
>
> > You can pick up your ticket at 30% off here 
> (source
> > tweet ).
> >
> > On Thu, Nov 1, 2018 at 10:33 AM Vaibhav Khanduja <
> > vaibhavkhand...@gmail.com> wrote:
> >
> >> Thanks for the email.
> >>
> >> Are there any promotional code available for enterprises?
> >>
> >> On Wed, Oct 31, 2018 at 5:06 PM Gastón Kleiman 
> >> wrote:
> >>
> >>> MesosCon 2018 is taking place next week! Join us and celebrate the 5th
> >>> anniversary of MesosCon November 5th-7th, in the The Village (969
> Market
> >>> St, San Francisco).
> >>>
> >>> MesosCon North America is an annual conference organized by the Apache
> >>> Mesos community, bringing together users and developers to share and
> >>> learn
> >>> about the Apache Mesos project, containers, DevOps, and automation.
> >>>
> >>> What to expect
> >>>
> >>> MesosCon will include tracks focused on case studies and architecture
> of
> >>> modern, containerized applications, fast data tools like Spark,
> >>> Cassandra,
> >>> and TensorFlow, and about Mesos itself. Attendees can expect engaging
> >>> keynotes, technical breakout sessions, and collaborative town hall
> >>> sessions
> >>> to include Mesos and the broader ecosystem. Attendees can expect to:
> >>>
> >>>
> >>>-
> >>>
> >>>Learn how to design and build their own custom frameworks
> >>>-
> >>>
> >>>Discover how easy it is to build, deploy, and scale your
> applications
> >>>-
> >>>
> >>>Dive deep into Mesos internals, storage, security, and networking
> >>>-
> >>>
> >>>Network with the community and share best practices and lessons
> >>> learned
> >>>
> >>>
> >>> Check out the schedule and register at http://mesoscon2018.org.
> >>>
> >>> Cheers,
> >>>
> >>> The MesosCon 2018 organization team
> >>>
> >> --
> >> You received this message because you are subscribed to the Google
> Groups
> >> "marathon-framework" group.
> >> To unsubscribe from this group and stop receiving emails from it, send
> an
> >> email to marathon-framework+unsubscr...@googlegroups.com.
> >> For more options, visit https://groups.google.com/d/optout.
> >>
> >
>


On committer candidate nomination

2018-10-16 Thread Alex Rukletsov
Folks,

A seemingly complex and long path to become a committer can drive away
potential candidates shortly after they start contributing to the project.
Around a year ago Jim Jagielski raised a concern about the high entry bar
we have in the project. We heard the feedback and decided to liberalize our
process for nominating new committers via simplifying the committer
checklist.

1) We have relaxed our committer candidate guidelines, see the updated
version [1].
2) Committer checklist [2] is a thing of the past: candidates are no longer
supposed to fill it in.
3) Nominators are encouraged to use template [3] when proposing new
candidates.

Alex on behalf of Mesos PMC.

[1]
https://github.com/apache/mesos/blob/07ab5abb1db91fda3fa118083dc15065f314a3fd/docs/committer-candidate-guidelines.md
[2]
https://github.com/apache/mesos/blob/69f3744f3b2f8e2a8116f023020696950af573ad/docs/committer-candidate-checklist.md
[3]
https://docs.google.com/document/d/1RBShT_kSqWqvG7HOzQhpNINGd17ZkJXGY7vMyxTZZXg/edit


Re: [VOTE] Release Apache Mesos 1.7.0 (rc3)

2018-09-14 Thread Alex Rukletsov
+1 (binding)

Mesosphere's internal CI run with the aforementioned tag. Observed 4 flaky
tests, 3 are known:
https://issues.apache.org/jira/browse/MESOS-5048
https://issues.apache.org/jira/browse/MESOS-8260
https://issues.apache.org/jira/browse/MESOS-8951

One has been introduced as part of adding GC to nested containers
(MESOS-7947), which is disabled in the release:
https://issues.apache.org/jira/browse/MESOS-9217


On Tue, Sep 11, 2018 at 8:09 PM, Gastón Kleiman 
wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.7.0.
>
>
> 1.7.0 includes the following:
> 
> 
> * Performance Improvements:
>   * Master `/state` endpoint: ~130% throughput improvement through
> RapidJSON
>   * Allocator: Improved allocator cycle significantly
>   * Agent `/containers` endpoint: Fixed a performance issue
>   * Agent container launch / destroy throughput is significantly improved
> * Containerization:
>   * **Experimental** Supported docker image tarball fetching from HDFS
>   * Added new `cgroups/all` and `linux/devices` isolators
>   * Added metrics for `network/cni` isolator and docker pull latency
> * Windows:
>   * Added support to libprocess for the Windows Thread Pool API
> * Multi-Framework Workloads:
>   * **Experimental** Added per-framework metrics to the master
>   * A new weighted random sorter was added as an alternative to the DRF
> sorter
>
> The CHANGELOG for the release is available at:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain
> ;f=CHANGELOG;hb=1.7.0-rc3
> 
> 
>
> The candidate for Mesos 1.7.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc3/mesos-1.7.0.tar.gz
>
> The tag to be voted on is 1.7.0-rc3:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc3
>
> The SHA512 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc3/mesos
> -1.7.0.tar.gz.sha512
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc3/mesos
> -1.7.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1234
>
> Please vote on releasing this package as Apache Mesos 1.7.0!
>
> The vote is open until Fri Sep 14 11:06:30 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.7.0
> [ ] -1 Do not release this package because ...
>
> Thanks,
>
> Chun-Hung & Gastón
>


Re: [VOTE] Release Apache Mesos 1.7.0 (rc1)

2018-08-22 Thread Alex Rukletsov
MESOS-9177 has been filed today. It is very likely a regression introduced
by one of the state.json improvements. We are still investigating, but it
is obviously a

-1 (binding)

for rc1.

Alex.


On Wed, Aug 22, 2018 at 4:34 AM, Chun-Hung Hsiao  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.7.0.
>
>
> 1.7.0 includes the following:
> 
> 
> * Performance Improvements:
>   * Master `/state` endpoint: ~130% throughput improvement through
> RapidJSON
>   * Allocator: Improved allocator cycle significantly
>   * Agent `/containers` endpoint: Fixed a performance issue
>   * Agent container launch / destroy throughput is significantly improved
> * Containerization:
>   * **Experimental** Supported docker image tarball fetching from HDFS
>   * Added new `cgroups/all` and `linux/devices` isolators
>   * Added metrics for `network/cni` isolator and docker pull latency
> * Windows:
>   * Added support to libprocess for the Windows Thread Pool API
> * Multi-Framework Workloads:
>   * **Experimental** Added per-framework metrics to the master
>   * A new weighted random sorter was added as an alternative to the DRF
> sorter
> * Bug fixes: 84 bugs fixed, including 20 critical ones.
>
> The CHANGELOG for the release is available at:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_
> plain;f=CHANGELOG;hb=1.7.0-rc1
> 
> 
>
> The candidate for Mesos 1.7.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc1/mesos-1.7.0.tar.gz
>
> The tag to be voted on is 1.7.0-rc1:
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc1
>
> The SHA512 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc1/
> mesos-1.7.0.tar.gz.sha512
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc1/
> mesos-1.7.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/service/local/repositories/
> orgapachemesos-1232/
>
> Please vote on releasing this package as Apache Mesos 1.7.0!
>
> The vote is open until Fri Aug 24 19:16:39 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.7.0
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Chun-Hung & Gaston
>


Re: [VOTE] Release Apache Mesos 1.4.2 (rc1)

2018-08-20 Thread Alex Rukletsov
+1 binding (make check on Mac OS 10.13.5)

On Mon, Aug 20, 2018 at 8:28 PM, Kapil Arya  wrote:

> +1 binding (internal CI).
>
> The Apache CI failures reported by Vinod are all known flaky tests. I have
> inserted the details inline.
>
> Best,
> Kapil
>
> On Tue, Aug 14, 2018 at 11:03 AM Vinod Kone  wrote:
>
>> I see some flaky tests in ASF CI, that I don't see already reported.
>>
>> @Kapil Arya   Can you take a look at
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/53 and
>> see
>> if the flaky tests are due to bugs in test code and not source?
>>
>> *Revision*: 612ec2c63a68b4d5b60d1d864e6703fde1c2a023
>>
>>- refs/tags/1.4.2-rc1
>>
>> Configuration Matrix gcc clang
>> centos:7 --verbose --enable-libevent --enable-ssl autotools
>> [image: Failed]
>> > Release/53/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
>> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
>> 20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%
>> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>>
>
> Failed due to the timeout being too short -- leveldb took 11 seconds to
> responds while the timeout expired at 10 seconds. It also looks like the
> previous operation also took longer than expected potentially due to some
> machine load at the time.
>
> E0814 00:51:13.001557  8738 registrar.cpp:575] Registrar aborting: Failed to 
> update registry: Failed to perform store within 10secs
> ../../src/tests/registrar_tests.cpp:331: Failure
> (registrar.apply( Owned( new MarkSlaveUnreachable(info1, 
> protobuf::getCurrentTime().failure(): Failed to update registry: Failed 
> to perform store within 10secs
> I0814 00:51:18.990106  8743 leveldb.cpp:341] Persisting action (218 bytes) to 
> leveldb took 11.656345772secs
>
>
> ubuntu:14.04 --verbose --enable-libevent --enable-ssl autotools
>> [image: Failed]
>> > Release/53/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
>> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
>> 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
>> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> [image: Failed]
>> > Release/53/BUILDTOOL=autotools,COMPILER=clang,
>> CONFIGURATION=--verbose%20--enable-libevent%20--enable-
>> ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%
>> 3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>>
>
>> --verbose autotools
>
> [image: Failed]
>> > Release/53/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose,
>> ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,
>> label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>>
>
> Failures because of known double-free corruption in test code due to
> parallel manipulation of signal and control handler: https://issues.
> apache.org/jira/browse/MESOS-8084
>
>
>
>> cmake
>> [image: Failed]
>> > Release/53/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
>> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
>> GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
>> docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>>
>
> Failure due to known flaky: https://issues.apache.org/
> jira/browse/MESOS-7028
>
> On Mon, Aug 13, 2018 at 7:41 PM Benjamin Mahler 
> wrote:
>
>>
>> > +1 (binding)
>> >
>> > make check passes on macOS 10.13.6 with Apple LLVM version 9.1.0
>> > (clang-902.0.39.2).
>> >
>> > Thanks Kapil!
>> >
>> > On Wed, Aug 8, 2018 at 3:06 PM, Kapil Arya  wrote:
>> >
>> > > Hi all,
>> > >
>> > > Please vote on releasing the following candidate as Apache Mesos
>> 1.4.2.
>> > >
>> > > 1.4.2 is a bug fix release. The CHANGELOG for the release is available
>> > at:
>> > > https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_
>> > > plain;f=CHANGELOG;hb=1.4.2-rc1
>> > >
>> > > The candidate for Mesos 1.4.2 release is available at:
>> > >
>> > https://dist.apache.org/repos/dist/dev/mesos/1.4.2-rc1/
>> mesos-1.4.2.tar.gz
>> > >
>> > > The tag to be voted on is 1.4.2-rc1:
>> > > https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.4.2-rc1
>> > >
>> > > The SHA512 checksum of the tarball can be found at:
>> > > https://dist.apache.org/repos/dist/dev/mesos/1.4.2-rc1/
>> > > mesos-1.4.2.tar.gz.sha512
>> > >
>> > > The signature of the tarball can be found at:
>> > > https://dist.apache.org/repos/dist/dev/mesos/1.4.2-rc1/
>> > > mesos-1.4.2.tar.gz.asc
>> > >
>> > > The PGP key used to sign the release is here:
>> > > https://dist.apache.org/repos/dist/release/mesos/KEYS
>> > >
>> > > The JAR is in a staging repository here:
>> > > https://repository.apache.org/content/repositories/
>> orgapachemesos-1231
>> > >
>> > > Please vote on releasing this package as Apache Mesos 1.4.2!
>> > >
>> > > The vote is open until Sat Aug 11 11:59:59 PDT 

'*.json' endpoints removed in 1.7

2018-08-08 Thread Alex Rukletsov
Folks,

The long ago deprecated '*.json' endpoints will be removed in Mesos 1.7.0.
Please use their non-'.json' counterparts instead.

Commit:
https://github.com/apache/mesos/commit/42551cb5290b7b04101f7d800b4b8fd573e47b91
JIRA ticket: https://issues.apache.org/jira/browse/MESOS-4509

Alex.


Re: [VOTE] Release Apache Mesos 1.3.3 (rc1)

2018-07-20 Thread Alex Rukletsov
MPark—

what's the decision regarding the 1.3.3 release?

On Mon, Jul 9, 2018 at 8:52 PM, Michael Park  wrote:

> I'm considering simply abandoning the 1.3.3 release and bringing the 1.3.x
> branch to end of life.
> If anyone really wants a 1.3.3, I'm certainly willing to finish the
> release portion of this
> but I don't have time to dig into the CI issue that Vinod pointed out. If
> someone feels compelled
> to investigate the issue and wants 1.3.3 released, please speak up.
>
> I'll wait for some time (say, a week) to gauge the interest and take
> corresponding action.
>
> Thanks,
>
> MPark
>
> On Thu, May 31, 2018 at 11:55 AM Vinod Kone  wrote:
>
>> -1 (binding).
>>
>>
>> Ran it in ASF CI and found an issue worth investigating. Other 3 issues
>> looks to be related to known flaky tests and/or known core dump issue (that
>> has been fixed in later versions).
>>
>> *Revision*: c78e56e4ea217878dd604de638623be166a18db0
>>
>>- refs/tags/1.3.3-rc1
>>
>> Configuration Matrix gcc clang
>> centos:7 --verbose --enable-libevent --enable-ssl autotools
>> [image: Failed]
>> 
>> [image: Not run]
>> cmake
>> [image: Success]
>> 
>> [image: Not run]
>> --verbose autotools
>> [image: Success]
>> 
>> [image: Not run]
>> cmake
>> [image: Success]
>> 
>> [image: Not run]
>> ubuntu:14.04 --verbose --enable-libevent --enable-ssl autotools
>> [image: Success]
>> 
>> [image: Failed]
>> 
>> cmake
>> [image: Success]
>> 
>> [image: Failed]
>> 
>> --verbose autotools
>> [image: Failed]
>> 
>> [image: Success]
>> 
>> cmake
>> [image: Success]
>> 
>> [image: Success]
>> 
>>
>>
>> 1) Segfault in HTTP Test.
>> 

Re: Backport Policy

2018-07-16 Thread Alex Rukletsov
Back porting as little as possible is the ultimate goal for me. My reasons
are closely aligned with what Andrew wrote above.

If we agree on this strategy, the next question is how to enforce it. My
intuition is that committers will lean towards back porting their patches
in arguable cases, because humans tend to overestimate the importance of
their personal work. Delegating the decision in such cases to a release
manager in my opinion will help us enforce the strategy of minimal number
backports. As a bonus, the release manager will have a much better
understanding of what's going on with the release, keyword: "more
ownership".

On Sat, Jul 14, 2018 at 12:07 AM, Andrew Schwartzmeyer <
and...@schwartzmeyer.com> wrote:

> I believe I fall somewhere between Alex and Ben.
>
> As for deciding what to backport or not, I lean toward Alex's view of
> backporting as little as possible (and agree with his criteria). My
> reasoning is that all changes can have unforeseen consequences, which I
> believe is something to be actively avoided in already released versions.
> The reason for backporting patches to fix regressions is the same as the
> reason to avoid backporting as much as possible: keep behavior consistent
> (and safe) within a release. With that as the goal of a branch in
> maintenance mode, it makes sense to fix regressions, and make exceptions to
> fix CVEs and other critical/blocking issues.
>
> As for who should decide what to backport, I lean toward Ben's view of the
> burden being on the committer. I don't think we should add more work for
> release managers, and I think the committer/shepherd obviously has the most
> understanding of the context around changes proposed for backport.
>
> Here's an example of a recent bugfix which I backported:
> https://reviews.apache.org/r/67587/ (for MESOS-3790)
>
> While normally I believe this change falls under "avoid due to unforeseen
> consequences," I made an exception as the bug was old, circa 2015,
> (indicating it had been an issue for others), and was causing recurring
> failures in testing. The fix itself was very small, meaning it was easier
> to evaluate for possible side effects, so I felt a little safer in that
> regard. The effect of not having the fix was a fatal and undesired crash,
> which furthermore left troublesome side effects on the system (you couldn't
> bring the agent back up). And lastly, a dependent project (DC/OS) wanted it
> in their next bump, which necessitated backporting to the release they were
> pulling in.
>
> I think in general we should backport only as necessary, and leave it on
> the committers to decide if backporting a particular change is necessary.
>
>
> On 07/13/2018 12:54 am, Alex Rukletsov wrote:
>
>> This is exactly where our views differ, Ben : )
>>
>> Ideally, I would like a release manager to have more ownership and less
>> manual work. In my imagination, a release manager has more power and
>> control about dates, features, backports and everything that is related to
>> "their" branch. I would also like us to back port as little as possible,
>> to
>> simplify testing and releasing patch versions.
>>
>> On Fri, Jul 13, 2018 at 1:17 AM, Benjamin Mahler 
>> wrote:
>>
>> +user, I probably it would be good to hear from users as well.
>>>
>>> Please see the original proposal as well as Alex's proposal and let us
>>> know
>>> your thoughts.
>>>
>>> To continue the discussion from where Alex left off:
>>>
>>> > Other bugs and significant improvements, e.g., performance, may be back
>>> ported,
>>> the release manager should ideally be the one who decides on this.
>>>
>>> I'm a little puzzled by this, why is the release manager involved? As we
>>> already document, backports occur when the bug is fixed, so this happens
>>> in
>>> the steady state of development, not at release time. The release manager
>>> only comes in at the time of the release itself, at which point all
>>> backports have already happened and the release manager handles the
>>> release
>>> process. Only blocker level issues can stop the release and while the
>>> release manager has a strong say, we should generally agree on what
>>> consists of a release blocking issue.
>>>
>>> Just to clarify my workflow, I generally backport every bug fix I commit
>>> that applies cleanly, right after I commit it to master (with the
>>> exceptions I listed below).
>>>
>>> On Thu, Jul 12, 2018 at 8:39 AM, Alex Rukletsov 
>>> wrote:
>>>
>>> > I would like to back port as litt

Re: Backport Policy

2018-07-13 Thread Alex Rukletsov
This is exactly where our views differ, Ben : )

Ideally, I would like a release manager to have more ownership and less
manual work. In my imagination, a release manager has more power and
control about dates, features, backports and everything that is related to
"their" branch. I would also like us to back port as little as possible, to
simplify testing and releasing patch versions.

On Fri, Jul 13, 2018 at 1:17 AM, Benjamin Mahler  wrote:

> +user, I probably it would be good to hear from users as well.
>
> Please see the original proposal as well as Alex's proposal and let us know
> your thoughts.
>
> To continue the discussion from where Alex left off:
>
> > Other bugs and significant improvements, e.g., performance, may be back
> ported,
> the release manager should ideally be the one who decides on this.
>
> I'm a little puzzled by this, why is the release manager involved? As we
> already document, backports occur when the bug is fixed, so this happens in
> the steady state of development, not at release time. The release manager
> only comes in at the time of the release itself, at which point all
> backports have already happened and the release manager handles the release
> process. Only blocker level issues can stop the release and while the
> release manager has a strong say, we should generally agree on what
> consists of a release blocking issue.
>
> Just to clarify my workflow, I generally backport every bug fix I commit
> that applies cleanly, right after I commit it to master (with the
> exceptions I listed below).
>
> On Thu, Jul 12, 2018 at 8:39 AM, Alex Rukletsov 
> wrote:
>
> > I would like to back port as little as possible. I suggest the following
> > criteria:
> >
> > * By default, regressions are back ported to existing release branches. A
> > bug is considered a regression if the functionality is present in the
> > previous minor or patch version and is not affected by the bug there.
> >
> > * Critical and blocker issues, e.g., a CVE, can be back ported.
> >
> > * Other bugs and significant improvements, e.g., performance, may be back
> > ported, the release manager should ideally be the one who decides on
> this.
> >
> > On Thu, Jul 12, 2018 at 12:25 AM, Vinod Kone 
> wrote:
> >
> > > Ben, thanks for the clarification. I'm in agreement with the points you
> > > made.
> > >
> > > Once we have consensus, would you mind updating the doc?
> > >
> > > On Wed, Jul 11, 2018 at 5:15 PM Benjamin Mahler 
> > > wrote:
> > >
> > > > I realized recently that we aren't all on the same page with
> > backporting.
> > > > We currently only document the following:
> > > >
> > > > "Typically the fix for an issue that is affecting supported releases
> > > lands
> > > > on the master branch and is then backported to the release
> branch(es).
> > In
> > > > rare cases, the fix might directly go into a release branch without
> > > landing
> > > > on master (e.g., fix / issue is not applicable to master)." [1]
> > > >
> > > > This leaves room for interpretation about what lies outside of
> > "typical".
> > > > Here's the simplest way I can explain what I stick to, and I'd like
> to
> > > hear
> > > > what others have in mind:
> > > >
> > > > * By default, bug fixes at any level should be backported to existing
> > > > release branches if it affects those releases. Especially important:
> > > > crashes, bugs in non-experimental features.
> > > >
> > > > * Exceptional cases that can omit backporting: difficult to backport
> > > fixes
> > > > (especially if the bugs are deemed of low priority), bugs in
> > experimental
> > > > features.
> > > >
> > > > * Exceptional non-bug cases that can be backported: performance
> > > > improvements.
> > > >
> > > > I realize that there is a ton of subtlety here (even in terms of
> which
> > > > things are defined as bugs). But I hope we can lay down a policy that
> > > gives
> > > > everyone the right mindset for common cases and then discuss corner
> > cases
> > > > on-demand in the future.
> > > >
> > > > [1] http://mesos.apache.org/documentation/latest/versioning/
> > > >
> > >
> >
>


Re: Proposing change to the allocatable check in the allocator

2018-06-12 Thread Alex Rukletsov
Instead of the master flag, why not a master API call. This will allow to
update the value without restarting the master.

Another thought is that we should explain operators how and when to use
this knob. For example, if they observe a behavioural pattern A, then it
means B is happening, and tuning the knob to C might help.

On Tue, Jun 12, 2018 at 7:36 AM, Jie Yu  wrote:

> I would suggest we also consider the possibility of adding per framework
> control on `min_allocatable_resources`.
>
> If we want to consider supporting per-framework setting, we should probably
> model this as a protobuf, rather than a free form JSON. The same protobuf
> can be reused for both master flag, framework API, or even supporting
> Resource Request in the future. Something like the following:
>
> message ResourceQuantityPredicate {
>   enum Type {
> SCALAR_GE,
>   }
>   optional Type type;
>   optional Value.Scalar scalar;
> }
> message ResourceRequirement {
>   required string resource_name;
>   oneof predicates {
> ResourceQuantityPredicate quantity;
>   }
> }
> message ResourceRequirementList {
>   // All requirements MUST be met.
>   repeated ResourceRequirement requirements;
> }
>
> // Resource request API.
> message Request {
>   repeated ResoruceRequrementList accepted;
> }
>
> // `allocatable()`
> message MinimalAllocatableResources {
>   repeated ResoruceRequrementList accepted;
> }
>
> On Mon, Jun 11, 2018 at 3:47 PM, Meng Zhu  wrote:
>
> > Hi:
> >
> > The allocatable
> >  allocator/mesos/hierarchical.cpp#L2471-L2479>
> >  check in the allocator (shown below) was originally introduced to
> >
> > help alleviate the situation where a framework receives some resources,
> > but no
> >
> > cpu/memory, thus cannot launch a task.
> >
> >
> > constexpr double MIN_CPUS = 0.01;constexpr Bytes MIN_MEM = Megabytes(32);
> > bool HierarchicalAllocatorProcess::allocatable(
> > const Resources& resources)
> > {
> >   Option cpus = resources.cpus();
> >   Option mem = resources.mem();
> >
> >   return (cpus.isSome() && cpus.get() >= MIN_CPUS) ||
> >  (mem.isSome() && mem.get() >= MIN_MEM);
> > }
> >
> >
> > Issues
> >
> > However, there has been a couple of issues surfacing lately surrounding
> > the check.
> >
> >-
> >- - MESOS-8935 Quota limit "chopping" can lead to cpu-only and
> >memory-only offers.
> >
> > We introduced fined-grained quota-allocation (MESOS-7099) in Mesos 1.5.
> > When we
> >
> > allocate resources to a role, we'll "chop" the available resources of the
> > agent up to the
> >
> > quota limit for the role. However, this has the unintended consequence of
> > creating
> >
> > cpu-only and memory-only offers, even though there might be other agents
> > with both
> >
> > cpu and memory resources available in the cluster.
> >
> >
> > - MESOS-8626 The 'allocatable' check in the allocator is problematic with
> > multi-role frameworks.
> >
> > Consider roleA reserved cpu/memory on an agent and roleB reserved disk on
> > the same agent.
> >
> > A framework under both roleA and roleB will not be able to get the
> > reserved disk due to the
> >
> > allocatable check. With the introduction of resource providers, the
> > similar situation will
> >
> > become more common.
> >
> > Proposed change
> >
> > Instead of hardcoding a one-size-fits-all value in Mesos, we are
> proposing
> > to add a new master flag
> >
> > min_allocatable_resources. It specifies one or more scalar resources
> > quantities that define the
> >
> > minimum allocatable resources for the allocator. The allocator will only
> > offer resources that are more
> >
> > than at least one of the specified resources.  The default behavior *is
> > backward compatible* i.e.
> >
> > by default, the flag is set to “cpus:0.01|mem:32”.
> >
> > Usage
> >
> > The flag takes in either a simple text of resource(s) delimited by a bar
> > (|) or a JSON array of JSON
> >
> > formatted resources. Note, the input should be “pure” scalar quantities
> > i.e. the specified resource(s)
> >
> > should only have name, type (set to scalar) and scalar fields set.
> >
> >
> > Examples:
> >
> >- - To eliminate cpu or memory only offer due to the quota chopping,
> >- we could set the flag to “cpus:0.01;mem:32”
> >-
> >- - To enable offering disk only offer, we could set the flag to
> >“disk:32”
> >-
> >- - For both, we could set the flag to “cpus:0.01;mem:32|disk:32”.
> >- Then the allocator will only offer resources that at least contain
> >“cpus:0.01;mem:32”
> >- OR resources that at least contain “disk:32”.
> >
> >
> > Let me know what you think! Thanks!
> >
> >
> > -Meng
> >
> >
>


Re: Update the *Minimum Linux Kernel version* supported on Mesos

2018-04-08 Thread Alex Rukletsov
This does not seem to me as a disruptive change, so I'm +1.

On Thu, Apr 5, 2018 at 6:36 PM, Jie Yu  wrote:

> User namespaces require >= 3.12 (November 2013). Can we make that the
>> minimum?
>
>
> No, we need to support CentOS7 which uses 3.10 (some variant)
>
> - Jie
>
> On Thu, Apr 5, 2018 at 8:56 AM, James Peach  wrote:
>
>>
>>
>> > On Apr 5, 2018, at 5:00 AM, Andrei Budnik 
>> wrote:
>> >
>> > Hi All,
>> >
>> > We would like to update minimum supported Linux kernel from 2.6.23 to
>> > 2.6.28.
>> > Linux kernel supports cgroups v1 starting from 2.6.24, but `freezer`
>> cgroup
>> > functionality was merged into 2.6.28, which supports nested containers.
>>
>> User namespaces require >= 3.12 (November 2013). Can we make that the
>> minimum?
>>
>> J
>
>
>


Re: Release policy and 1.6 release schedule

2018-03-26 Thread Alex Rukletsov
I would like us to do monthly releases and support 10 branches at a time.
Ideally, releasing that often reduces the burden for the release manager,
because there are less changes and less new features. However, we lack
automation to support this pace: our release guide [1] is several pages
long and includes quite a few non-trivial steps. It would be great to find
some time (maybe during the next Mesos hackathon?) and revisit our release
procedures, but until then I'm +1 for quarterly.

[1] https://mesos.apache.org/documentation/latest/release-guide/

On Sat, Mar 24, 2018 at 5:48 AM, Vinod Kone  wrote:

> I’m +1 for quarterly.
>
> Most importantly I want us to adhere to a predictable cadence.
>
> Sent from my phone
>
> On Mar 23, 2018, at 9:21 PM, Jie Yu  wrote:
>
> It's a burden for supporting multiple releases.
>
> 1.2 was released March, 2017 (1 year ago), and I know that some users are
> still on that version
> 1.3 was released June, 2017 (9 months ago), and we're still maintaining it
> (still backport patches
> 
>  several
> days ago, which some users asked)
> 1.4 was released Sept, 2017 (6 months ago).
> 1.5 was released Feb, 2018 (1 month ago).
>
> As you can see, users expect a release to be supported 6-9 months (e.g.,
> backports are still needed for 1.3 release, which is 9 months old). If we
> were to do monthly minor release, we'll probably need to maintain 6-9
> release branches? That's too much of an ask for committers and maintainers.
>
> I also agree with folks that there're benefits doing releases more
> frequently. Given the historical data, I'd suggest we do quarterly
> releases, and maintain three release branches.
>
> - Jie
>
> On Fri, Mar 23, 2018 at 10:03 AM, Greg Mann  wrote:
>
>> The best motivation I can think of for a shorter release cycle is this: if
>> the release cadence is fast enough, then developers will be less likely to
>> rush a feature into a release. I think this would be a real benefit, since
>> rushing features in hurts stability. *However*, I'm not sure if every two
>> months is fast enough to bring this benefit. I would imagine that a
>> two-month wait is still long enough that people wouldn't want to wait an
>> entire release cycle to land their feature. Just off the top of my head, I
>> might guess that a release cadence of 1 month or shorter would be often
>> enough that it would always seem reasonable for a developer to wait until
>> the next release to land a feature. What do y'all think?
>>
>> Other motivating factors that have been raised are:
>> 1) Many users upgrade on a longer timescale than every ~2 months. I think
>> that this doesn't need to affect our decision regarding release timing -
>> since we guarantee compatibility of all releases with the same major
>> version number, there is no reason that a user needs to upgrade minor
>> releases one at a time. It's fine to go from 1.N to 1.(N+3), for example.
>> 2) Backporting will be a burden if releases are too short. I think that in
>> practice, backporting will not take too much longer. If there was a
>> conflict back in the tree somewhere, then it's likely that after resolving
>> that conflict once, the same diff can be used to backport the change to
>> previous releases as well.
>> 3) Adhering strictly to a time-based release schedule will help users plan
>> their deployments, since they'll be able to rely on features being
>> released
>> on-schedule. However, if we do strict time-based releases, then it will be
>> less certain that a particular feature will land in a particular release,
>> and users may have to wait a release cycle to get the feature.
>>
>> Personally, I find the idea of preventing features from being rushed into
>> a
>> release very compelling. From that perspective, I would love to see
>> releases every month. However, if we're not going to release that often,
>> then I think it does make sense to adjust our release schedule to
>> accommodate the features that community members want to land in a
>> particular release.
>>
>>
>> Jie, I'm curious why you suggest a *minimal* interval between releases.
>> Could you elaborate a bit on your motivations there?
>>
>> Cheers,
>> Greg
>>
>>
>> On Fri, Mar 16, 2018 at 2:01 PM, Jie Yu  wrote:
>>
>> > Thanks Greg for starting this thread!
>> >
>> >
>> >> My primary motivation here is to bring our documented policy in line
>> >> with our practice, whatever that may be
>> >
>> >
>> > +100
>> >
>> > Do people think that we should attempt to bring our release cadence more
>> >> in line with our current stated policy, or should the policy be changed
>> >> to reflect our current practice?
>> >
>> >
>> > I think a minor release every 2 months is probably too aggressive. I
>> don't
>> > have concrete data, but my feeling is that the frequency that folks
>> upgrade
>> > Mesos is low. I know 

Re: Mesos 1.5.0 Release

2017-12-22 Thread Alex Rukletsov
https://issues.apache.org/jira/browse/MESOS-8297 has just landed. Let's
include it in 1.5.0 as well.

On Fri, Dec 22, 2017 at 4:35 AM, Jie Yu  wrote:

> Yeah, I am doing a grooming right now.
>
> Sent from my iPhone
>
> > On Dec 21, 2017, at 7:25 PM, Benjamin Mahler  wrote:
> >
> > Meng is working on https://issues.apache.org/jira/browse/MESOS-8352 and
> we
> > should land it tonight if not tomorrow. I can cherry pick if it's after
> > your cut, and worst case it can go in 1.5.1.
> >
> > Have you guys gone over the unresolved items targeted for 1.5.0? I see a
> > lot of stuff, might be good to start adjusting / removing their target
> > versions to give folks a chance to respond on the ticket?
> >
> > https://issues.apache.org/jira/issues/?jql=project%20%
> 3D%20MESOS%20AND%20status%20in%20(Open%2C%20%22In%
> 20Progress%22%2C%20Reviewable%2C%20Accepted)%20AND%20%
> 22Target%20Version%2Fs%22%20%3D%201.5.0
> >
> > For example, https://issues.apache.org/jira/browse/MESOS-8337 looks
> pretty
> > bad to me (master crash).
> >
> >> On Thu, Dec 21, 2017 at 7:00 PM, Jie Yu  wrote:
> >>
> >> Hi,
> >>
> >> We're about to cut 1.5.0-rc1 tomorrow. If you have any thing that needs
> to
> >> go into 1.5.0 that hasn't landed, please let me or Gilbert know asap.
> >> Thanks!
> >>
> >> - Jie
> >>
> >>> On Fri, Dec 1, 2017 at 3:58 PM, Gilbert Song 
> wrote:
> >>>
> >>> Folks,
> >>>
> >>> It is time for Mesos 1.5.0 release. I am the release manager.
> >>>
> >>> We plan to cut the rc1 in next couple weeks. Please start to wrap up
> >>> patches if you are contributing or shepherding any issue. If you expect
> >>> any
> >>> particular JIRA for this new release, please set *Target Version* as "
> >>> *1.5.0"* and mark it as "*Blocker*" priority.
> >>>
> >>> The dashboard for Mesos 1.5.0 will be posted in this thread soon.
> >>>
> >>> Cheers,
> >>> Gilbert
> >>>
> >>
> >>
>


[RESULT][VOTE] Release Apache Mesos 1.1.3 (rc2)

2017-08-31 Thread Alex Rukletsov
Hi all,

The vote for Mesos 1.1.3 (rc2) has passed with the
following votes.

+1 (Binding)
--
Alex R
Till Tönshoff
Vinod Kone

There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.1.3

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.3

The mesos-1.1.3.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks,
Till & Alex


Re: [VOTE] Release Apache Mesos 1.1.3 (rc2)

2017-08-31 Thread Alex Rukletsov
+1

Tested on internal CI and additionally `make check` on Fedora 25 and Mac OS
10.11.6.

On Thu, Aug 31, 2017 at 2:50 AM, Till Toenshoff <toensh...@me.com> wrote:

> +1
>
> Tested on internal CI as well as on macOS 10.12 and macOS 10.13 DP 8 using
> Apple’s clang (Xcode 8.3.3 and Xcode 9.0.0 beta 6).
>
> > On Aug 27, 2017, at 8:33 PM, Vinod Kone <vinodk...@apache.org> wrote:
> >
> > +1 (binding)
> >
> > Tested on ASF CI. The only red build was the known perf core dump issue.
> >
> > Revision: ce77d91bd3a59227d5684ce0783b460c54ea311f
> > refs/tags/1.1.3-rc2
> > Configuration Matrix  gcc clang
> > centos:7  --verbose --enable-libevent --enable-sslautotools
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >
> > cmake
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%
> 7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >
> > --verbose autotools
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,
> ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_
> exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >
> > cmake
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%
> 3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >
> > ubuntu:14.04  --verbose --enable-libevent --enable-sslautotools
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--
> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> > cmake
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
> docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=-
> -verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
> docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> > --verbose autotools
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,
> ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,
> label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose,
> ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,
> label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> > cmake
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%
> 3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >  <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/40/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=-
> -verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%
> 3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> >
> > On Fri, Aug 25, 2017 at 7:48 AM, Alex Rukletsov <a...@mesosphere.com
> <mailto:a...@mesosphere.com>> wrote:
> > Folks,
> >
> > Please vote on releasing the fo

[VOTE] Release Apache Mesos 1.1.3 (rc2)

2017-08-25 Thread Alex Rukletsov
Folks,

Please vote on releasing the following candidate as Apache Mesos 1.1.3.
Note that this will be the last 1.1.x release.

1.1.3 includes the following:

** Bug
 * [MESOS-5187] - The filesystem/linux isolator does not set the
permissions of the host_path.
  * [MESOS-6743] - Docker executor hangs forever if `docker stop` fails.
  * [MESOS-6950] - Launching two tasks with the same Docker image
simultaneously may cause a staging dir never cleaned up.
  * [MESOS-7540] - Add an agent flag for executor re-registration timeout.
  * [MESOS-7569] - Allow "old" executors with half-open connections to be
preserved during agent upgrade / restart.
  * [MESOS-7689] - Libprocess can crash on malformed request paths for
libprocess messages.
  * [MESOS-7690] - The agent can crash when an unknown executor tries to
register.
  * [MESOS-7581] - Fix interference of external Boost installations when
using some unbundled dependencies.
  * [MESOS-7703] - Mesos fails to exec a custom executor when no shell is
used.
  * [MESOS-7728] - Java HTTP adapter crashes JVM when leading master
disconnects.
  * [MESOS-7770] - Persistent volume might not be mounted if there is a
sandbox volume whose source is the same as the target of the persistent
volume.
  * [MESOS-] - Agent failed to recover due to mount namespace leakage
in Docker 1.12/1.13.
  * [MESOS-7796] - LIBPROCESS_IP isn't passed on to the fetcher.
  * [MESOS-7830] - Sandbox_path volume does not have ownership set
correctly.
  * [MESOS-7863] - Agent may drop pending kill task status updates.
  * [MESOS-7865] - Agent may process a kill task and still launch the task.

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.3-rc2


The candidate for Mesos 1.1.3 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.3-rc2/mesos-1.1.3.tar.gz

The tag to be voted on is 1.1.3-rc2:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.3-rc2

The MD5 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.3-rc2/mesos-1.1.3.tar.gz.md5

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.3-rc2/mesos-1.1.3.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is up in Maven in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1208

Please vote on releasing this package as Apache Mesos 1.1.3!

The vote is open until Wed Aug 28 23:59:59 CEST 2017 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.1.3
[ ] -1 Do not release this package because ...

Thanks,
Alex & Till


Re: Mesos 1.1.3 release

2017-08-17 Thread Alex Rukletsov
We have two more issues that I would like to have in 1.1.3 because it's the
last 1.1.x release:
https://issues.apache.org/jira/browse/MESOS-7865
https://issues.apache.org/jira/browse/MESOS-7863

They are in review and will be back ported soon.

On Tue, Jul 25, 2017 at 11:28 AM, Alex Rukletsov <a...@mesosphere.com>
wrote:

> MESOS-7643 is still unresolved. I am moving the cut date for one more
> week, because this is the last patch release for 1.1.x.
>
> On Fri, Jul 14, 2017 at 6:34 PM, Alex Rukletsov <a...@mesosphere.com>
> wrote:
>
>> Folks,
>>
>> We are planning to cut the 1.1.3 release once MESOS-7643 is resolved. If
>> you have any patch that needs to get into 1.1.3, please make sure that
>> either it is already in the 1.1.x branch or the corresponding ticket has
>> a target version including 1.1.3.
>>
>> The release dashboard:
>> https://issues.apache.org/jira/secure/Dashboard.jspa?selectP
>> ageId=12331463
>>
>> Till & Alex.
>>
>> On Wed, Jun 14, 2017 at 12:59 PM, Alex Rukletsov <a...@mesosphere.com>
>> wrote:
>>
>>> Folks,
>>>
>>> there are only 2 back ported tickets to the 1.1.x branch so far (MESOS-7540
>>> and MESOS-7569). Since this will be the last 1.1.x release, we are
>>> delaying it for 3 more weeks to leave more time for people to include
>>> critical bug fixes.
>>>
>>> Till & Alex.
>>>
>>
>>
>


Re: Mesos 1.1.3 release

2017-07-25 Thread Alex Rukletsov
MESOS-7643 is still unresolved. I am moving the cut date for one more week,
because this is the last patch release for 1.1.x.

On Fri, Jul 14, 2017 at 6:34 PM, Alex Rukletsov <a...@mesosphere.com> wrote:

> Folks,
>
> We are planning to cut the 1.1.3 release once MESOS-7643 is resolved. If
> you have any patch that needs to get into 1.1.3, please make sure that
> either it is already in the 1.1.x branch or the corresponding ticket has
> a target version including 1.1.3.
>
> The release dashboard:
> https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12331463
>
> Till & Alex.
>
> On Wed, Jun 14, 2017 at 12:59 PM, Alex Rukletsov <a...@mesosphere.com>
> wrote:
>
>> Folks,
>>
>> there are only 2 back ported tickets to the 1.1.x branch so far (MESOS-7540
>> and MESOS-7569). Since this will be the last 1.1.x release, we are
>> delaying it for 3 more weeks to leave more time for people to include
>> critical bug fixes.
>>
>> Till & Alex.
>>
>
>


Re: Mesos 1.1.3 release

2017-07-14 Thread Alex Rukletsov
Folks,

We are planning to cut the 1.1.3 release once MESOS-7643 is resolved. If
you have any patch that needs to get into 1.1.3, please make sure that
either it is already in the 1.1.x branch or the corresponding ticket has a
target version including 1.1.3.

The release dashboard:
https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12331463

Till & Alex.

On Wed, Jun 14, 2017 at 12:59 PM, Alex Rukletsov <a...@mesosphere.com>
wrote:

> Folks,
>
> there are only 2 back ported tickets to the 1.1.x branch so far (MESOS-7540
> and MESOS-7569). Since this will be the last 1.1.x release, we are
> delaying it for 3 more weeks to leave more time for people to include
> critical bug fixes.
>
> Till & Alex.
>


Re: Executors and CPU allocations

2017-06-26 Thread Alex Rukletsov
Regarding your second idea, you may have a "dummy" task with, say, 1.8 CPU
and "run" it iff there is at least another real task running, while
assigning 0.1 CPU for your executor. You can do some bookkeeping in the
executor to determine whether a certain executor is idle (and hence a
"dummy" task should be sent) when accepting an offer. This might be racy,
so you may want to "kill" the dummy task after a certain timeout on the
executor.

Similar to the above, you can also terminate executors from the scheduler
if you don't need them any more, or for a certain period of time.

On Mon, Jun 19, 2017 at 4:09 AM, Christopher Hunt <
christopher.h...@lightbend.com> wrote:

> Hi there,
>
> We have a framework that runs on Mesos and DC/OS. There is a core and an
> agent design to our framework which equates to a Mesos scheduler and
> executor respectively. The executor is responsible for forking and managing
> processes w.r.t. to our problem domain. Given that the executor is written
> in Scala and runs on the JVM, we find that it requires at least 1.9 CPUs to
> be allocated in order to function reasonably well. Also, given that it is a
> JVM process we also “warm up” the executors by starting them for each
> distinct node that we receive offers for. This keeps our domain of task
> management feeling responsive.
>
> Our problem is that our executor will consume 1.9 CPUs even when whether
> we have no further tasks. Given that Mesos deducts 1.9 from the number of
> available CPUs on each node, our users quickly complain that there’s no
> resource left to run anything else.
>
> I’m hoping to solicit ideas on how we can manage our executor more
> effectively. Clearly, consuming 1.9 cpus when effectively doing nothing is
> undesirable.
>
> Some ideas:
>
> * start the executor only when required - we tried this and the resulting
> experience felt sluggish given the overhead of starting the JVM based
> executor
> * start the executor with fewer CPU requirements (say, 1.0 CPUs), and then
> change its CPU share via ExecutorInfo when we have tasks to run - I’m not
> sure that this is possible - I think Mesos complains if ExecutorInfo is
> changed given that a previous task has supplied it
> * Given Mesos 1.3 and its support for multiple roles, have our framework
> register its own role so that the user has more control over where our
> executors are placed - at present we target all nodes where we receive an
> offer i.e. “*”.
> * re-write the executor off the JVM e.g. using Rust - this would be
> non-trivial
>
> Thoughts/more ideas?
>
> Thanks in advance.
>
> Kind regards,.
> Christopher
>
> Christopher Hunt
> *Technical Lead, Lightbend Enterprise Suite*
> @huntchr
> UTC+10
>
>


On Apache Mesos release process

2017-06-17 Thread Alex Rukletsov
Folks,

for more than a year Apache Mesos releases are done according to our "then
new" release policy [1]. It seems to work quite well, but today I would
like to address things that can be improved.

Let's start with pain points:
* A minor bug can cancel a release vote, even for a patch release.
* More canceled votes lead to more RCs and hence create more work for
committers and voters.
* Demotivation for release on a candidate unless other people vote.
* Releases often run behind schedule.

I would like to suggest some improvements to the process:

1. Stricter time releases. The next release should go into planning (with
release managers elected) right after the current is cut. Feature owners
work with the release managers prior to the cut to track progress (k8s
community aims for 2-3 meeting per week discussing blockers and schedule).
This way release managers should have a satisfactory understanding which
new features are going in and what can slow down the release several days
before the cut.

2. Written guideline for which issues can '-1' the release. Though it is up
to the voter how to vote, a clear guideline will set reasonable
expectations and hopefully help us decrease the number of RCs. Regressions
(security, performance, compatibility, functional) can cause -1.
Regressions of experimental features cannot cause -1. Patch releases can be
-1'd in exceptional cases, e.g., critical bug fix missing in the last patch
release. New features cannot block a release.

Note: We love reasonable -1 votes! It is so much better to defer a release
than discover a critical regression from a production user report!

3. Release managers decides what is back ported to the RC branch once it is
cut (same for patch releases). Feature owners and committers are encouraged
to update the release managers timely on the status and importance of
features and bug fixes.

And of course, I encourage everyone using Mesos to test & vote on release
candidates! Identical cluster configurations are rare, each new setup helps
with finding bugs and hence build better software.

[1] https://github.com/apache/mesos/blob/master/docs/versioning.md

Alex.


Mesos 1.1.3 release

2017-06-14 Thread Alex Rukletsov
Folks,

there are only 2 back ported tickets to the 1.1.x branch so far (MESOS-7540
and MESOS-7569). Since this will be the last 1.1.x release, we are delaying
it for 3 more weeks to leave more time for people to include critical bug
fixes.

Till & Alex.


Re: [VOTE] Release Apache Mesos 1.2.1 (rc1)

2017-06-12 Thread Alex Rukletsov
PortMapping tests are indeed in bade shape. There are JIRAs already, have a
look before filing new ones:
MESOS-4646, MESOS-5687, MESOS-2765, MESOS-5690, MESOS-5688, MESOS-5689,
MESOS-4643, MESOS-4644, MESOS-5309

On Sat, Jun 10, 2017 at 10:58 AM, Adam Bordelon  wrote:

> +1 (binding) Good enough for me.
>
> Ran `make check` (or equivalent) on the Mesosphere internal Jenkins CI.
> Lots of green (all tests passed) on Mac, CentOS7, Debian8, Fedora23 and
> Ubuntu 12.04.
> Three sets of yellow configs yielded 10 unique but mostly known
> failing/flaky tests.
> (Grey means untested)
> [image: Inline image 1]
>
> * Ubuntu {14.04|16.04|16.10} - {Plain|SSL|CMake|Clang}
>   PerfTest.Version (always) - https://issues.apache.org/
> jira/browse/MESOS-7160
>   ExamplesTest.PythonFramework (sometimes) - https://issues.apache.org/
> jira/browse/MESOS-7218
>
> * Centos 6 - {Plain|SSL}
>   DockerContainerizerTest.ROOT_DOCKER_LaunchWithPersistentVolumes -
> https://issues.apache.org/jira/browse/MESOS-7510
>
> * Fedora 23 - Network_Isolator
>   PortMappingIsolatorTest.ROOT_NC_HostToContainerUDP -
> https://issues.apache.org/jira/browse/MESOS-5690
>   PortMappingIsolatorTest.ROOT_ContainerICMPExternal -
> https://issues.apache.org/jira/browse/MESOS-5689
>   PortMappingIsolatorTest.ROOT_DNS - https://issues.apache.org/
> jira/browse/MESOS-5688
>   PortMappingIsolatorTest.ROOT_NC_SmallEgressLimit -
> https://issues.apache.org/jira/browse/MESOS-5687
>   PortMappingIsolatorTest.ROOT_NC_PortMappingStatistics - ?
>   PortMappingMesosTest.CGROUPS_ROOT_RecoverMixedContainers - ?
>   PortMappingMesosTest.CGROUPS_ROOT_RecoverMixedKnownAndUnKnownOrphans - ?
>
> Anybody have any ideas on the last three? Seems like these PortMapping
> tests are generally in a bad shape, or the network isolator is seriously
> broken. I'll file JIRAs.
>
> P.S. AgentAPIStreamingTest.AttachInputToNestedContainerSession Vinod saw
> on ASF CI is flaky according to https://issues.apache.org/jira
> /browse/MESOS-7159 (added the log gist link there)
>
> P.P.S.  CI results at https://jenkins.mesosphere.
> com/service/jenkins/job/mesos/job/Mesos_CI-build/1215 for those with
> access. We're still working on exposing our CI to the public. Waiting is.
>
>
> On Thu, Jun 8, 2017 at 4:23 PM, Benjamin Mahler 
> wrote:
>
>> +1 (binding)
>>
>> make check passed on macOS 10.12.4
>>
>> The ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession
>> passed for me. Kevin, I captured the logs to the failed run vinod pointed
>> to here:
>>
>> https://gist.github.com/bmahler/5ae340b4de3341f3c1f072250006dc64
>>
>> Does that look like a flaky test or a bug?
>>
>> On Thu, Jun 8, 2017 at 4:07 PM, Benjamin Mahler 
>> wrote:
>>
>>> Vinod I think that's the getenv issue from: https://issues.apache.or
>>> g/jira/browse/MESOS-6985
>>>
>>> On Wed, May 17, 2017 at 5:57 PM, Till Toenshoff 
>>> wrote:
>>>
 +1

 Ran it through DC/OS builds and integration tests;
 https://github.com/dcos/dcos/pull/1530 => all green

 On May 17, 2017, at 10:01 PM, Vinod Kone  wrote:

 Ran it on ASF CI and saw some issues.

 Segfault in "MasterTest.MultipleExecutors" in two builds [1]
 
 [2
 ],
 which is concerning. Is this a known issue?

 "ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession" 
 test failed 
 .




 On Sun, May 14, 2017 at 12:55 AM, tommy xiao  wrote:

> +1
>
> 2017-05-12 7:33 GMT+08:00 Adam Bordelon :
>
> > Hi all,
> >
> > Please vote on releasing the following candidate as Apache Mesos
> 1.2.1.
> >
> > 1.2.1 is a bug fix release. The CHANGELOG for the release is
> available at:
> > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_
> > plain;f=CHANGELOG;hb=1.2.1-rc1
> >
> > The candidate for Mesos 1.2.1 release is available at:
> > https://dist.apache.org/repos/dist/dev/mesos/1.2.1-rc1/mesos
> -1.2.1.tar.gz
> >
> > The tag to be voted on is 

[RESULT][VOTE] Release Apache Mesos 1.1.2 (rc2)

2017-05-19 Thread Alex Rukletsov
Hi all,

The vote for Mesos 1.1.2 (rc2) has passed with the following votes.

+1 (Binding)
--
Vinod Kone
Till Tönshoff
Alex Rukletsov

There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.1.2

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.2

The mesos-1.1.2.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks,
Alex & Till


[VOTE] Release Apache Mesos 1.1.2 (rc2)

2017-05-12 Thread Alex Rukletsov
Folks,

Please vote on releasing the following candidate as Apache Mesos 1.1.2.

1.1.2 includes the following:

** Bug
  * [MESOS-2537] - AC_ARG_ENABLED checks are broken.
  * [MESOS-5028] - Copy provisioner cannot replace directory with symlink.
  * [MESOS-5172] - Registry puller cannot fetch blobs correctly from http
Redirect 3xx urls.
  * [MESOS-6327] - Large docker images causes container launch failures:
Too many levels of symbolic links.
  * [MESOS-7057] - Consider using the relink functionality of libprocess in
the executor driver.
  * [MESOS-7119] - Mesos master crash while accepting inverse offer.
  * [MESOS-7152] - The agent may be flapping after the machine reboots due
to provisioner recover.
  * [MESOS-7197] - Requesting tiny amount of CPU crashes master.
  * [MESOS-7210] - HTTP health check doesn't work when mesos runs with
--docker_mesos_image.
  * [MESOS-7237] - Enabling cgroups_limit_swap can lead to "invalid
argument" error.
  * [MESOS-7265] - Containerizer startup may cause sensitive data to leak
into sandbox logs.
  * [MESOS-7350] - Failed to pull image from Nexus Registry due to
signature missing.
  * [MESOS-7366] - Agent sandbox gc could accidentally delete the entire
persistent volume content.
  * [MESOS-7383] - Docker executor logs possibly sensitive parameters.
  * [MESOS-7422] - Docker containerizer should not leak possibly sensitive
data to agent log.
  * [MESOS-7471] - Provisioner recover should not always assume 'rootfses'
dir exists.
  * [MESOS-7482] - #elif does not match #ifdef when checking the platform.

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.2-rc2


The candidate for Mesos 1.1.2 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc2/mesos-1.1.2.tar.gz

The tag to be voted on is 1.1.2-rc2:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.2-rc2

The MD5 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc2/mesos-1.1.2.tar.gz.md5

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc2/mesos-1.1.2.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is up in Maven in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1194

Please vote on releasing this package as Apache Mesos 1.1.2!

The vote is open until Wed May 17 17:17:17 CEST 2017 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.1.2
[ ] -1 Do not release this package because ...

Thanks,
Till & Alex


Re: [VOTE] Release Apache Mesos 1.1.2 (rc1)

2017-05-12 Thread Alex Rukletsov
Vinod, the failure you've observed is a known flaky test:
https://issues.apache.org/jira/browse/MESOS-6724

MESOS-7471 <https://issues.apache.org/jira/browse/MESOS-7471> has been
backported. We don't have any other blockers, I'll be cutting a new rc soon.

On Wed, May 10, 2017 at 6:03 PM, Alex Rukletsov <a...@mesosphere.io> wrote:

> This vote is cancelled. Vinod, I'll look into the failure and report back.
> After that, I'll start a new vote.
>
> On 9 May 2017 10:07 am, "Jie Yu" <yujie@gmail.com> wrote:
>
>> -1
>>
>> I suggest we include this fix in 1.1.2
>> https://issues.apache.org/jira/browse/MESOS-7471
>>
>> On Thu, May 4, 2017 at 12:07 PM, Alex Rukletsov <a...@mesosphere.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Please vote on releasing the following candidate as Apache Mesos 1.1.2.
>>>
>>> 1.1.2 includes the following:
>>> 
>>> 
>>> ** Bug
>>>   * [MESOS-2537] - AC_ARG_ENABLED checks are broken.
>>>   * [MESOS-5028] - Copy provisioner cannot replace directory with
>>> symlink.
>>>   * [MESOS-5172] - Registry puller cannot fetch blobs correctly from http
>>> Redirect 3xx urls.
>>>   * [MESOS-6327] - Large docker images causes container launch failures:
>>> Too many levels of symbolic links.
>>>   * [MESOS-7057] - Consider using the relink functionality of libprocess
>>> in
>>> the executor driver.
>>>   * [MESOS-7119] - Mesos master crash while accepting inverse offer.
>>>   * [MESOS-7152] - The agent may be flapping after the machine reboots
>>> due
>>> to provisioner recover.
>>>   * [MESOS-7197] - Requesting tiny amount of CPU crashes master.
>>>   * [MESOS-7210] - HTTP health check doesn't work when mesos runs with
>>> --docker_mesos_image.
>>>   * [MESOS-7237] - Enabling cgroups_limit_swap can lead to "invalid
>>> argument" error.
>>>   * [MESOS-7265] - Containerizer startup may cause sensitive data to leak
>>> into sandbox logs.
>>>   * [MESOS-7350] - Failed to pull image from Nexus Registry due to
>>> signature missing.
>>>   * [MESOS-7366] - Agent sandbox gc could accidentally delete the entire
>>> persistent volume content.
>>>   * [MESOS-7383] - Docker executor logs possibly sensitive parameters.
>>>   * [MESOS-7422] - Docker containerizer should not leak possibly
>>> sensitive
>>> data to agent log.
>>>
>>> The CHANGELOG for the release is available at:
>>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_p
>>> lain;f=CHANGELOG;hb=1.1.2-rc1
>>> 
>>> 
>>>
>>> The candidate for Mesos 1.1.2 release is available at:
>>> https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos
>>> -1.1.2.tar.gz
>>>
>>> The tag to be voted on is 1.1.2-rc1:
>>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.2-rc1
>>>
>>> The MD5 checksum of the tarball can be found at:
>>> https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos
>>> -1.1.2.tar.gz.md5
>>>
>>> The signature of the tarball can be found at:
>>> https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos
>>> -1.1.2.tar.gz.asc
>>>
>>> The PGP key used to sign the release is here:
>>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>>
>>> The JAR is up in Maven in a staging repository here:
>>> https://repository.apache.org/content/repositories/orgapachemesos-1188
>>>
>>> Please vote on releasing this package as Apache Mesos 1.1.2!
>>>
>>> The vote is open until Tue May 9 12:12:12 CEST 2017 and passes if a
>>> majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Mesos 1.1.2
>>> [ ] -1 Do not release this package because ...
>>>
>>> Thanks,
>>> Alex & Till
>>>
>>
>>


Re: [VOTE] Release Apache Mesos 1.1.2 (rc1)

2017-05-10 Thread Alex Rukletsov
This vote is cancelled. Vinod, I'll look into the failure and report back.
After that, I'll start a new vote.

On 9 May 2017 10:07 am, "Jie Yu" <yujie@gmail.com> wrote:

> -1
>
> I suggest we include this fix in 1.1.2
> https://issues.apache.org/jira/browse/MESOS-7471
>
> On Thu, May 4, 2017 at 12:07 PM, Alex Rukletsov <a...@mesosphere.com>
> wrote:
>
>> Hi all,
>>
>> Please vote on releasing the following candidate as Apache Mesos 1.1.2.
>>
>> 1.1.2 includes the following:
>> 
>> 
>> ** Bug
>>   * [MESOS-2537] - AC_ARG_ENABLED checks are broken.
>>   * [MESOS-5028] - Copy provisioner cannot replace directory with symlink.
>>   * [MESOS-5172] - Registry puller cannot fetch blobs correctly from http
>> Redirect 3xx urls.
>>   * [MESOS-6327] - Large docker images causes container launch failures:
>> Too many levels of symbolic links.
>>   * [MESOS-7057] - Consider using the relink functionality of libprocess
>> in
>> the executor driver.
>>   * [MESOS-7119] - Mesos master crash while accepting inverse offer.
>>   * [MESOS-7152] - The agent may be flapping after the machine reboots due
>> to provisioner recover.
>>   * [MESOS-7197] - Requesting tiny amount of CPU crashes master.
>>   * [MESOS-7210] - HTTP health check doesn't work when mesos runs with
>> --docker_mesos_image.
>>   * [MESOS-7237] - Enabling cgroups_limit_swap can lead to "invalid
>> argument" error.
>>   * [MESOS-7265] - Containerizer startup may cause sensitive data to leak
>> into sandbox logs.
>>   * [MESOS-7350] - Failed to pull image from Nexus Registry due to
>> signature missing.
>>   * [MESOS-7366] - Agent sandbox gc could accidentally delete the entire
>> persistent volume content.
>>   * [MESOS-7383] - Docker executor logs possibly sensitive parameters.
>>   * [MESOS-7422] - Docker containerizer should not leak possibly sensitive
>> data to agent log.
>>
>> The CHANGELOG for the release is available at:
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_p
>> lain;f=CHANGELOG;hb=1.1.2-rc1
>> 
>> 
>>
>> The candidate for Mesos 1.1.2 release is available at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos-1.1.2.tar.gz
>>
>> The tag to be voted on is 1.1.2-rc1:
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.2-rc1
>>
>> The MD5 checksum of the tarball can be found at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos
>> -1.1.2.tar.gz.md5
>>
>> The signature of the tarball can be found at:
>> https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos
>> -1.1.2.tar.gz.asc
>>
>> The PGP key used to sign the release is here:
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>> The JAR is up in Maven in a staging repository here:
>> https://repository.apache.org/content/repositories/orgapachemesos-1188
>>
>> Please vote on releasing this package as Apache Mesos 1.1.2!
>>
>> The vote is open until Tue May 9 12:12:12 CEST 2017 and passes if a
>> majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Mesos 1.1.2
>> [ ] -1 Do not release this package because ...
>>
>> Thanks,
>> Alex & Till
>>
>
>


[VOTE] Release Apache Mesos 1.1.2 (rc1)

2017-05-04 Thread Alex Rukletsov
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.1.2.

1.1.2 includes the following:

** Bug
  * [MESOS-2537] - AC_ARG_ENABLED checks are broken.
  * [MESOS-5028] - Copy provisioner cannot replace directory with symlink.
  * [MESOS-5172] - Registry puller cannot fetch blobs correctly from http
Redirect 3xx urls.
  * [MESOS-6327] - Large docker images causes container launch failures:
Too many levels of symbolic links.
  * [MESOS-7057] - Consider using the relink functionality of libprocess in
the executor driver.
  * [MESOS-7119] - Mesos master crash while accepting inverse offer.
  * [MESOS-7152] - The agent may be flapping after the machine reboots due
to provisioner recover.
  * [MESOS-7197] - Requesting tiny amount of CPU crashes master.
  * [MESOS-7210] - HTTP health check doesn't work when mesos runs with
--docker_mesos_image.
  * [MESOS-7237] - Enabling cgroups_limit_swap can lead to "invalid
argument" error.
  * [MESOS-7265] - Containerizer startup may cause sensitive data to leak
into sandbox logs.
  * [MESOS-7350] - Failed to pull image from Nexus Registry due to
signature missing.
  * [MESOS-7366] - Agent sandbox gc could accidentally delete the entire
persistent volume content.
  * [MESOS-7383] - Docker executor logs possibly sensitive parameters.
  * [MESOS-7422] - Docker containerizer should not leak possibly sensitive
data to agent log.

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.2-rc1


The candidate for Mesos 1.1.2 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos-1.1.2.tar.gz

The tag to be voted on is 1.1.2-rc1:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.2-rc1

The MD5 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos-1.1.2.tar.gz.md5

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.2-rc1/mesos-1.1.2.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is up in Maven in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1188

Please vote on releasing this package as Apache Mesos 1.1.2!

The vote is open until Tue May 9 12:12:12 CEST 2017 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.1.2
[ ] -1 Do not release this package because ...

Thanks,
Alex & Till


Re: Default executor grace period

2017-04-25 Thread Alex Rukletsov
Commented on the ticket.

On Tue, Jan 17, 2017 at 12:27 PM, Tomek Janiszewski 
wrote:

> Created issue for this: https://issues.apache.org/jira/browse/MESOS-6933
>
> pon., 16 sty 2017 o 17:13 użytkownik Tomek Janiszewski 
> napisał:
>
>> I looks like it's supported because executor prints grace period[1]. On
>> the other hand executor launches sh that launch command and shell executes
>> faster then command after receiving SIGTERM. Causing process to be attached
>> to init and leaked. In my opinion default executor should not sent SIGTERM
>> to sh but only to its children. This will allow proper escalation to
>> SIGKILL because sh will leave as long its children are alive.
>>
>> 1: https://github.com/apache/mesos/blob/c4667d6f1b49d30089e6cb5874b673
>> 7a9bd3f044/src/launcher/executor.cpp#L479-L480
>>
>> pon., 16 sty 2017 o 16:35 użytkownik haosdent 
>> napisał:
>>
>> It looks like default-executor have not yet handle
>> `--executor_shutdown_grace_period`。
>>
>> On Mon, Jan 16, 2017 at 7:41 PM, Tomek Janiszewski 
>> wrote:
>>
>> Hi
>>
>> I tried to use grace period with default Mesos executor. I assumed it
>> works as follow:
>>
>>1. Start command: sh -c "command ..."
>>2. Sent SIGSTOP to process tree: sh, command
>>3. Sent SIGTERM to process tree: sh, command
>>4. Wait for processes to finish or grace period to elapse
>>5. sh finish while command could be still running and attached to init
>>6. Sent SIGKILL to process tree: command
>>
>> I notice that SIGKILL is not sent and executor finished when sh returns.
>> When Mesos is running with POSIX contenerizer this leads command to live
>> forever (if it ignores SIGTERM). When contenerizer is used command is
>> killed when it's container is destroyed.
>>
>> Is this desired behavior? How to use grace period with default executor?
>>
>> Thanks
>> Tomek
>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>>


Re: mesos container cluster came across health check coredump log

2017-03-31 Thread Alex Rukletsov
Cool, looking forward to it!

On Fri, Mar 31, 2017 at 4:30 AM, tommy xiao <xia...@gmail.com> wrote:

> Alex,Yes, let me have a try.
>
> 2017-03-31 3:16 GMT+08:00 Alex Rukletsov <a...@mesosphere.com>:
>
>> This is https://issues.apache.org/jira/browse/MESOS-7210. Deshi, do you
>> want to send the patch? I or Haosdent can shepherd.
>>
>> A.
>>
>> On Thu, Mar 30, 2017 at 12:27 PM, tommy xiao <xia...@gmail.com> wrote:
>>
>>> interesting for the specified case.
>>>
>>> 2017-03-30 7:52 GMT+08:00 Jie Yu <yujie@gmail.com>:
>>>
>>>> + AlexR, haosdent
>>>>
>>>> For posterity, the root cause of this problem is that when agent is
>>>> running inside a docker container and `--docker_mesos_image` flag is
>>>> specified, the pid namespace of the executor container (which initiate the
>>>> health check) is different than the root pid namespace. Therefore, getting
>>>> the network namespace handle using `/proc//ns/net` does not work
>>>> because the 'pid' here is in the root pid namespace (reported by docker
>>>> daemon).
>>>>
>>>> Alex and haosdent, I think we should fix this issue. As suggested
>>>> above, we can launch the executor container with --pid=host if
>>>> `--docker_mesos_image` is specified.
>>>>
>>>> - Jie
>>>>
>>>> On Wed, Mar 29, 2017 at 3:56 AM, tommy xiao <xia...@gmail.com> wrote:
>>>>
>>>>> it resolved by add --pid=host.  thanks for community guys supports.
>>>>> thanks a lot.
>>>>>
>>>>> 2017-03-29 9:52 GMT+08:00 tommy xiao <xia...@gmail.com>:
>>>>>
>>>>>> My Environment is specified:
>>>>>>
>>>>>> mesos 1.2 in docker containerized.
>>>>>>
>>>>>> send a sample nginx docker container with mesos native health check.
>>>>>>
>>>>>> then get sandbox core dump.
>>>>>>
>>>>>> i have digg into more information for your reference:
>>>>>>
>>>>>> in mesos slave container, i can only see task container pid. but i
>>>>>> can't found process nginx pid.
>>>>>>
>>>>>> but in host console, i can found the nginx pid. so how can i get the
>>>>>> pid in container?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2017-03-28 13:49 GMT+08:00 tommy xiao <xia...@gmail.com>:
>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/MESOS-6184
>>>>>>>
>>>>>>> anyone give some hint?
>>>>>>>
>>>>>>> ```
>>>>>>>
>>>>>>> I0328 11:48:12.922181 48 exec.cpp:162] Version: 1.2.0
>>>>>>> I0328 11:48:12.929252 54 exec.cpp:237] Executor registered on agent
>>>>>>> a29dc3a5-3e3f-4058-8ab4-dd7de2ae58d1-S4
>>>>>>> I0328 11:48:12.931640 54 docker.cpp:850] Running docker -H
>>>>>>> unix:///var/run/docker.sock run --cpu-shares 10 --memory 33554432
>>>>>>> --env-file /tmp/gvqGyb -v /data/mesos/slaves/a29dc3a5-3e
>>>>>>> 3f-4058-8ab4-dd7de2ae58d1-S4/frameworks/d7ef5d2b-f924-42d9-a
>>>>>>> 274-c020afba6bce-/executors/0-hc-xychu-datamanmesos-2f3b
>>>>>>> 47f9ffc048539c7b22baa6c32d8f/runs/458189b8-2ff4-4337-ad3a-67321e96f5cb:/mnt/mesos/sandbox
>>>>>>> --net bridge --label=USER_NAME=xychu --label=GROUP_NAME=groupautotest
>>>>>>> --label=APP_ID=hc --label=VCLUSTER=clusterautotest
>>>>>>> --label=USER=xychu --label=CLUSTER=datamanmesos --label=SLOT=0
>>>>>>> --label=APP=hc -p 31000:80/tcp --name mesos-a29dc3a5-3e3f-4058-8ab4-
>>>>>>> dd7de2ae58d1-S4.458189b8-2ff4-4337-ad3a-67321e96f5cb nginx
>>>>>>> I0328 11:48:16.145714 53 health_checker.cpp:196] Ignoring failure as
>>>>>>> health check still in grace period
>>>>>>> W0328 11:48:26.289958 49 health_checker.cpp:202] Health check failed
>>>>>>> 1 times consecutively: HTTP health check failed: curl returned 
>>>>>>> terminated
>>>>>>> with signal Aborted (core dumped): ABORT: (../../../3rdparty/libprocess/
>>>>>>> include/process/posix/subprocess.

Re: mesos container cluster came across health check coredump log

2017-03-30 Thread Alex Rukletsov
This is https://issues.apache.org/jira/browse/MESOS-7210. Deshi, do you
want to send the patch? I or Haosdent can shepherd.

A.

On Thu, Mar 30, 2017 at 12:27 PM, tommy xiao  wrote:

> interesting for the specified case.
>
> 2017-03-30 7:52 GMT+08:00 Jie Yu :
>
>> + AlexR, haosdent
>>
>> For posterity, the root cause of this problem is that when agent is
>> running inside a docker container and `--docker_mesos_image` flag is
>> specified, the pid namespace of the executor container (which initiate the
>> health check) is different than the root pid namespace. Therefore, getting
>> the network namespace handle using `/proc//ns/net` does not work
>> because the 'pid' here is in the root pid namespace (reported by docker
>> daemon).
>>
>> Alex and haosdent, I think we should fix this issue. As suggested above,
>> we can launch the executor container with --pid=host if
>> `--docker_mesos_image` is specified.
>>
>> - Jie
>>
>> On Wed, Mar 29, 2017 at 3:56 AM, tommy xiao  wrote:
>>
>>> it resolved by add --pid=host.  thanks for community guys supports.
>>> thanks a lot.
>>>
>>> 2017-03-29 9:52 GMT+08:00 tommy xiao :
>>>
 My Environment is specified:

 mesos 1.2 in docker containerized.

 send a sample nginx docker container with mesos native health check.

 then get sandbox core dump.

 i have digg into more information for your reference:

 in mesos slave container, i can only see task container pid. but i
 can't found process nginx pid.

 but in host console, i can found the nginx pid. so how can i get the
 pid in container?




 2017-03-28 13:49 GMT+08:00 tommy xiao :

> https://issues.apache.org/jira/browse/MESOS-6184
>
> anyone give some hint?
>
> ```
>
> I0328 11:48:12.922181 48 exec.cpp:162] Version: 1.2.0
> I0328 11:48:12.929252 54 exec.cpp:237] Executor registered on agent
> a29dc3a5-3e3f-4058-8ab4-dd7de2ae58d1-S4
> I0328 11:48:12.931640 54 docker.cpp:850] Running docker -H
> unix:///var/run/docker.sock run --cpu-shares 10 --memory 33554432
> --env-file /tmp/gvqGyb -v /data/mesos/slaves/a29dc3a5-3e
> 3f-4058-8ab4-dd7de2ae58d1-S4/frameworks/d7ef5d2b-f924-42d9-a
> 274-c020afba6bce-/executors/0-hc-xychu-datamanmesos-2f3b
> 47f9ffc048539c7b22baa6c32d8f/runs/458189b8-2ff4-4337-ad3a-67321e96f5cb:/mnt/mesos/sandbox
> --net bridge --label=USER_NAME=xychu --label=GROUP_NAME=groupautotest
> --label=APP_ID=hc --label=VCLUSTER=clusterautotest --label=USER=xychu
> --label=CLUSTER=datamanmesos --label=SLOT=0 --label=APP=hc -p 31000:80/tcp
> --name 
> mesos-a29dc3a5-3e3f-4058-8ab4-dd7de2ae58d1-S4.458189b8-2ff4-4337-ad3a-67321e96f5cb
> nginx
> I0328 11:48:16.145714 53 health_checker.cpp:196] Ignoring failure as
> health check still in grace period
> W0328 11:48:26.289958 49 health_checker.cpp:202] Health check failed 1
> times consecutively: HTTP health check failed: curl returned terminated
> with signal Aborted (core dumped): ABORT: (../../../3rdparty/libprocess/
> include/process/posix/subprocess.hpp:190): Failed to execute
> Subprocess::ChildHook: Failed to enter the net namespace of pid 18596: Pid
> 18596 does not exist
>
>-
>   -
>  - Aborted at 1490672906 (unix time) try "date -d
>  @1490672906" if you are using GNU date ***
>  PC: @ 0x7f26bfb485f7 __GI_raise
>  - SIGABRT (@0x4a) received by PID 74 (TID 0x7f26ba152700)
>  from PID 74; stack trace: ***
>  @ 0x7f26c0703100 (unknown)
>  @ 0x7f26bfb485f7 __GI_raise
>  @ 0x7f26bfb49ce8 __GI_abort
>  @ 0x7f26c315778e _Abort()
>  @ 0x7f26c31577cc _Abort()
>  @ 0x7f26c237a4b6 process::internal::childMain()
>  @ 0x7f26c2379e9c std::_Function_handler<>::_M_invoke()
>  @ 0x7f26c2379e53 process::internal::defaultClone()
>  @ 0x7f26c237b951 process::internal::cloneChild()
>  @ 0x7f26c237954f process::subprocess()
>  @ 0x7f26c15a9fb1 mesos::internal::checks::Healt
>  hCheckerProcess::httpHealthCheck()
>  @ 0x7f26c15ababd mesos::internal::checks::Healt
>  hCheckerProcess::performSingleCheck()
>  @ 0x7f26c2331389 process::ProcessManager::resume()
>  @ 0x7f26c233a3f7 _ZNSt6thread5_ImplISt12_Bind_s
>  impleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M
>  _runEv
>  @ 0x7f26c04a1220 (unknown)
>  @ 0x7f26c06fbdc5 start_thread
>  @ 0x7f26bfc0928d __clone
>  W0328 11:48:36.340055 55 health_checker.cpp:202] Health
>  check failed 2 times consecutively: HTTP health check failed: 
> curl returned
>  terminated with 

[RESULT][VOTE] Release Apache Mesos 1.1.1 (rc2)

2017-03-14 Thread Alex Rukletsov
 Hi folks,

The vote for Mesos 1.1.1 (rc2) has passed with the following votes.

+1 (Binding)
--
*** AlexR
*** Till Tönshoff
*** Vinod Kone

There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.1.1

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.1

The mesos-1.1.1.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks,
Alex & Till


Re: [VOTE] Release Apache Mesos 1.1.1 (rc2)

2017-03-14 Thread Alex Rukletsov
The vote is up for more than two weeks now and there are no -1's. I go
ahead and vote myself:

+1 (binding)

Tested on internal CI with several know issues.

On Tue, Mar 7, 2017 at 6:08 PM, Till Toenshoff <toensh...@me.com> wrote:

> +1
>
> Tested on:
> - macOS 10.12.4 Beta (16E175b): ok
> - centos 6: mostly ok, MESOS-4736
> - centos 7: internal CI issues on capabilities tests, otherwise fine
> - debian 8: mostly ok, MESOS-7213
> - fedora 23: ok
> - ubuntu 12.04: mostly ok, MESOS-7218
> - ubuntu 14.04: mostly ok, MESOS-7218
> - ubuntu 16.04: mostly ok, MESOS-7218
>
>
> On Mar 4, 2017, at 1:09 AM, Vinod Kone <vinodk...@apache.org> wrote:
>
> +1 (binding)
>
> Since the perf issue I reported earlier doesn't seem to be a blocker.
>
> On Fri, Mar 3, 2017 at 12:14 AM, Alex Rukletsov <a...@mesosphere.com>
> wrote:
>
>> Was this perf issue introduced by one of the fixes included in 1.1.1-rc2?
>> If not, I would suggest we vote for 1.1.1-rc2 and back port the perf fix
>> into 1.1.2. IIUC, time based patch releases should *not be worse*, hence
>> if
>> the perf issue was already in 1.1.0 it is *fine* to fix it in 1.1.2. I
>> would like to avoid postponing already belated 1.1.1 for even longer.
>>
>> On Wed, Mar 1, 2017 at 8:02 PM, Vinod Kone <vinodk...@apache.org> wrote:
>>
>> > Tested on ASF CI.
>> >
>> > Saw 2 configurations fail with
>> > https://issues.apache.org/jira/browse/MESOS-7160
>> >
>> > I think @jpeach and @bbannier were looking into this. Not sure about the
>> > severity of the issue, so withholding my vote.
>> >
>> >
>> > *Revision*: b9d8202a7444d0d1e49476bfc9817eb4583beaff
>> >
>> >- refs/tags/1.1.1-rc2
>> >
>> > Configuration Matrix gcc clang
>> > centos:7 --verbose --enable-libevent --enable-ssl autotools
>> > [image: Success]
>> > <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
>> > Release/30/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--
>> verbose%20--
>> > enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
>> > 20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%
>> > 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> > [image: Not run]
>> > cmake
>> > [image: Success]
>> > <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
>> > Release/30/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
>> > verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
>> > GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%
>> > 7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> > [image: Not run]
>> > --verbose autotools
>> > [image: Success]
>> > <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
>> > Release/30/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,
>> > ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_
>> > exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> > [image: Not run]
>> > cmake
>> > [image: Success]
>> > <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
>> > Release/30/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
>> > verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%
>> > 3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> > [image: Not run]
>> > ubuntu:14.04 --verbose --enable-libevent --enable-ssl autotools
>> > [image: Success]
>> > <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
>> > Release/30/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--
>> verbose%20--
>> > enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
>> > 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
>> > 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> > [image: Failed]
>> > <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
>> > Release/30/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=
>> --verbose%20--
>> > enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
>> > 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
>> > 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
>> > cmake
>> > [image: Success]
>> > <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
>> > Release/30/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
>> > verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
>> > GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_e

Re: [VOTE] Release Apache Mesos 1.1.1 (rc2)

2017-03-03 Thread Alex Rukletsov
Was this perf issue introduced by one of the fixes included in 1.1.1-rc2?
If not, I would suggest we vote for 1.1.1-rc2 and back port the perf fix
into 1.1.2. IIUC, time based patch releases should *not be worse*, hence if
the perf issue was already in 1.1.0 it is *fine* to fix it in 1.1.2. I
would like to avoid postponing already belated 1.1.1 for even longer.

On Wed, Mar 1, 2017 at 8:02 PM, Vinod Kone <vinodk...@apache.org> wrote:

> Tested on ASF CI.
>
> Saw 2 configurations fail with
> https://issues.apache.org/jira/browse/MESOS-7160
>
> I think @jpeach and @bbannier were looking into this. Not sure about the
> severity of the issue, so withholding my vote.
>
>
> *Revision*: b9d8202a7444d0d1e49476bfc9817eb4583beaff
>
>- refs/tags/1.1.1-rc2
>
> Configuration Matrix gcc clang
> centos:7 --verbose --enable-libevent --enable-ssl autotools
> [image: Success]
> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/30/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Not run]
> cmake
> [image: Success]
> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/30/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%
> 7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Not run]
> --verbose autotools
> [image: Success]
> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/30/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,
> ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_
> exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Not run]
> cmake
> [image: Success]
> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/30/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%
> 3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Not run]
> ubuntu:14.04 --verbose --enable-libevent --enable-ssl autotools
> [image: Success]
> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/30/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Failed]
> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/30/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--
> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> cmake
> [image: Success]
> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/30/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
> docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Success]
> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/30/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=-
> -verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
> GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,label_exp=(
> docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> --verbose autotools
> [image: Success]
> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/30/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,
> ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,
> label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Failed]
> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/30/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose,
> ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,
> label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> cmake
> [image: Success]
> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/30/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--
> verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%
> 3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/>
> [image: Success]
> <https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/30/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=-
> -verbose,ENVIRONMENT=GLOG_v=1%20MESO

[VOTE] Release Apache Mesos 1.1.1 (rc2)

2017-02-27 Thread Alex Rukletsov
 Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.1.1.

1.1.1 includes the following:

** Bug
  * [MESOS-6002] - The whiteout file cannot be removed correctly using aufs
backend.
  * [MESOS-6010] - Docker registry puller shows decode error "No response
decoded".
  * [MESOS-6142] - Frameworks may RESERVE for an arbitrary role.
  * [MESOS-6360] - The handling of whiteout files in provisioner is not
correct.
  * [MESOS-6411] - Add documentation for CNI port-mapper plugin.
  * [MESOS-6526] - `mesos-containerizer launch --environment` exposes
executor env vars in `ps`.
  * [MESOS-6571] - Add "--task" flag to mesos-execute.
  * [MESOS-6597] - Include v1 Operator API protos in generated JAR and
python packages.
  * [MESOS-6606] - Reject optimized builds with libcxx before 3.9.
  * [MESOS-6621] - SSL downgrade path will CHECK-fail when using both
temporary and persistent sockets.
  * [MESOS-6624] - Master WebUI does not work on Firefox 45.
  * [MESOS-6676] - Always re-link with scheduler during re-registration.
  * [MESOS-6848] - The default executor does not exit if a single task pod
fails.
  * [MESOS-6852] - Nested container's launch command is not set correctly
in docker/runtime isolator.
  * [MESOS-6917] - Segfault when the executor sets an invalid UUID when
sending a status update.
  * [MESOS-7008] - Quota not recovered from registry in empty cluster.
  * [MESOS-7133] - mesos-fetcher fails with openssl-related output.

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.1-rc2


The candidate for Mesos 1.1.1 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.1-rc2/mesos-1.1.1.tar.gz

The tag to be voted on is 1.1.1-rc2:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.1-rc2

The MD5 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.1-rc2/mesos-1.1.1.tar.gz.md5

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.1-rc2/mesos-1.1.1.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is up in Maven in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1182

Please vote on releasing this package as Apache Mesos 1.1.1!

The vote is open until Thu Mar  2 23:59:59 CET 2017 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.1.1
[ ] -1 Do not release this package because ...

Thanks,
Till & Alex


Re: customized IP for health check

2017-01-18 Thread Alex Rukletsov
I'm not sure that exposing a domain will help: do you know the IP of your
task upfront, i.e., at the moment when you construct TaskInfo? Isn't your
task listening on all interfaces?

On Wed, Jan 18, 2017 at 9:54 AM, CmingXu  wrote:

> the network I am currently used is USER, and each task was assigned
> with a unique vLAN IP with the underlaying docker driver is Macvlan. I
> want my framework user have the ability to define there own
> HealthChecks with the IP assigned to a specific task.
>
> I walked through the Mesos source code and obviously the TCP & HTTP
> doesn't meet my requirements as DEFAULT_DOMAIN is hard coded, now the
> only option to be might be health check with COMMAND, but if TCP does
> support passing IP would be great help.
>
> Thanks
>
> On Wed, Jan 18, 2017 at 4:40 PM, Jie Yu  wrote:
> > Hi, can you elaborate a bit more on why you need to use an customized IP,
> > rather than using localhost for health check?
> >
> > - Jie
> >
> > On Wed, Jan 18, 2017 at 9:19 AM, CmingXu  wrote:
> >>
> >> Is there any plan we support customized IP when define a health check?
> >> If true, what's the ETA?
> >>
> >> thanks
> >
> >
>


[Design Doc] Arbitrary task checks in Mesos

2017-01-05 Thread Alex Rukletsov
We've recently been working on a design for arbitrary task checks [1]

in
Mesos (currently called probes, but this will likely change). Please have a
look and leave comments on the doc or start high-level discussion on this
thread.

Alex.

[1]
https://docs.google.com/document/d/1VLdaH7i7UDT3_38aOlzTOtH7lwH-laB8dCwNzte0DkU


Mesos 1.1.1 release dashboard

2016-12-22 Thread Alex Rukletsov
Folks,

We are planning to cut the 1.1.1 release early next week. If you have any
patches that need to get into 1.1.1, please make sure that either it is
already in the 1.1.x branch or the corresponding ticket has a target
version including 1.1.1 *by Monday* Dec 26.

The release dashboard:
https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12329892

AlexR & Till.


Re: Mesos on AWS

2016-12-21 Thread Alex Rukletsov
Kiril—

from what you described it does not sound like the problem is the Linux
distribution. It may be your AWS configuration. However, if a combination
of health checks and heavy loaded agent leads to the agent termination — I
would like to investigate this issue. Please come back—with logs!—if you
see the issue again.

On Tue, Dec 20, 2016 at 3:46 PM, Kiril Menshikov 
wrote:

> ​Hey,
>
> Sorry for delayed response. I reinstalled my AWS infrastructure. Now I
> install everything on RedHat linux. Before I use Amazon Linux.
>
> I tested with single master (m4.large). Everything works perfect. I am not
> sure if it was Amazon Linux or my old configurations.
>
> Thanks,
> ​-Kirils
>
> On 18 December 2016 at 14:03, Guillermo Rodriguez 
> wrote:
>
>> Hi,
>> I run my mesos cluster in AWS, betewwn 40 to 100 m4.2xlarge instances at
>> any time. Between 200 and 1500 jobs anytime. Slaves run as spot instances.
>>
>> So, the only moment I get a TASK_LOST is when I lose a spot instance due
>> to being outbid.
>>
>> I guess you may also lose instances due to an AWS autoscaler scale-in
>> procedure, for example, if it decides the cluster is inderutilised then it
>> can kill any instane in your cluster, not necessarilly the least used one.
>> That's the reason we decided to develop our customised autoscaler that
>> detects and kills specific instances based on our own rules.
>>
>> So, are you using spot fleets or spot innstances? Have you setup your
>> scale-in procedures correctly?
>>
>> Also, if you are running fine grained tiny jobs (400 jobs in a 10xlarge
>> means 0.1 CPUs and 400MB RAM each), I recommend you avoid an m4.10xlarge
>> instance and run xlarge instances instead. Same price and if you lose one
>> you just lose 1/10th of your jobs.
>>
>> Luck!
>>
>>
>>
>>
>>
>> --
>> *From*: "haosdent" 
>> *Sent*: Saturday, December 17, 2016 6:12 PM
>> *To*: "user" 
>> *Subject*: Re: Mesos on AWS
>>
>> >  sometimes Mesos agent is launched but master doesn’t show them.
>> It sounds like the Master Master could not connect to your Agents. May
>> you mind paste your Mesos Master log? Any information show Mesos agents are
>> disconnected in it?
>>
>> On Sat, Dec 17, 2016 at 4:08 AM, Kiril Menshikov 
>> wrote:
>>>
>>> I have my own framework. Sometimes I get TASK_LOST status with message
>>> slave lost during health check.
>>>
>>> Also I found sometimes Mesos agent is launched but master doesn’t show
>>> them. From agent I see that it found master and connected. After agent
>>> restart it start working.
>>>
>>> -Kiril
>>>
>>>
>>>
>>> On Dec 16, 2016, at 21:58, Zameer Manji  wrote:
>>>
>>> Hey,
>>>
>>> Could you detail on what you mean by "delays and health check problems"?
>>> Are you using your own framework or an existing one? How are you launching
>>> the tasks?
>>>
>>> Could you share logs from Mesos that show timeouts to ZK?
>>>
>>> For reference, I operate a large Mesos cluster and I have never
>>> encountered problems when running 1k tasks concurrently so I think sharing
>>> data would help everyone debug this problem.
>>>
>>> On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov 
>>> wrote:

 ?Hi,

 Does any body try to run Mesos on AWS instances? Can you give me
 recommendations.

 I am developing elastic (scale aws instances on demand) Mesos cluster.
 Currently I have 3 master instances. I run about 1000 tasks simultaneously.
 I see delays and health check problems.

 ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU).

 At the moment I increase time out in ZooKeeper cluster. What can I do
 to decrease timeouts?

 Also how can I increase performance? The main bottleneck is what I have
 the big amount of tasks(run simultaneously) for an hour after I shutdown
 them or restart (depends how good them perform).

 -Kiril?

 --
 Zameer Manji

>>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Thanks,
> -Kiril
> Phone +37126409291 <+371%2026%20409%20291>
> Riga, Latvia
> Skype perimetr122
>


Re: Proposal: mesosadm, the command to bootstrap the mesos cluster.

2016-12-14 Thread Alex Rukletsov
I have a different opinion on this. Several years ago I came across the
concept of "mean wizards" — any helpers that hide away important steps from
the user and hence do not give them opportunity to learn how things
actually work. (If you're interested it was about projects in Borland IDEs
that were giving you an editor with a "play" button in several clicks,
hiding all intermediate steps.)

Mesos is not the simplest piece of software. I want its users to
understand, what quorum is. I want them to understand how and why use rate
limiting. I want that they spend time figuring out why pointing work_dir to
/tmp is probably not the best idea.

Instead of giving them a false feeling that running Mesos in production is
as easy as running `cat` or `nano`, I'd rather focus in helping them
learning Mesos: better docs, tutorials, flags descriptions and so on.

On Wed, Dec 14, 2016 at 5:23 PM, tommy xiao  wrote:

> feel it.
>
> $ kubeadm reset $ service kubelet start $ kubeadm init
> --use-kubernetes-version=v1.5.1 for the freshest of kubes
>
>
> 2016-12-14 12:48 GMT+08:00 tommy xiao :
>
>> yeah.
>>
>> 2016-12-14 12:22 GMT+08:00 haosdent :
>>
>>> We have a discussion in China User Group before.
>>> And Jay Guo mentioned that a better way may be just to remove zookeeper,
>>> and use the replicate log to do election.
>>> So for new comer, users just need to start masters and agents in
>>> production without zookeeper or etcd.
>>> The only necessary configuration item is the master address list, which
>>> would reduce a big overload to get starting Mesos.
>>>
>>> On Tue, Dec 13, 2016 at 4:20 PM, Stephen Gran 
>>> wrote:
>>>
 Hi,

 I'm quite happy with the current approach of bootstrapping a new agent
 with the location of zookeeper and a set of credentials.  This allows
 our automation code to make new agents join the cluster automatically.

 Not that I'm opposed to the two step process you propose, I'm sure we
 can make that happen automatically as well, but aside from making mesos
 look more like other solutions, does it bring semantics that would be
 useful?  ie, are there actions that 'mesosadm init' would initiate?  Or
 would this be purely an interactive way to do the same things you can do
 now by seeding out config files?

 Cheers,

 On 13/12/16 05:14, tommy xiao wrote:
 > Hi team,
 >
 >
 > I came from china mesos community. in today's group discussion, we
 came
 > across a topic: Howto enhance user's cluster experience?
 >
 > Because newcome user is top resource for a community. if we can
 enhance
 > currently mesos cluster installation steps, it will help us fastly
 > bootstrap in user community.
 >
 > why mesosadm?
 >
 > such as Swarm cluster setup steps:
 >
 > 1. docker init
 > 2. docker join
 >
 > another kuberenetes 1.5 cluster setup steps:
 >
 > 1. kubeadm init
 > 2. kubeadm join --token  
 >
 > So i think the init, join style is good experience for normal user.
 How
 > about you think?
 >
 >
 >
 > --
 > Deshi Xiao
 > Twitter: xds2000
 > E-mail: xiaods(AT)gmail.com 

 --
 Stephen Gran
 Senior Technical Architect

 picture the possibilities | piksel.com

>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Haosdent Huang
>>>
>>
>>
>>
>> --
>> Deshi Xiao
>> Twitter: xds2000
>> E-mail: xiaods(AT)gmail.com
>>
>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>


Re: Command healthcheck failed but status KILLED

2016-12-12 Thread Alex Rukletsov
Technically the task hast not failed but was killed by the executor
(because it failed a health check).

On Fri, Dec 9, 2016 at 11:27 AM, Tomek Janiszewski 
wrote:

> Hi
>
> What is desired behavior when command health check failed? On Mesos 1.0.2
> when health check fails task has state KILLED instead of FAILED with reason
> specifying it was killed due to failing health check.
>
> Thanks
> Tomek
>


Re: Duplicate task IDs

2016-12-11 Thread Alex Rukletsov
I'm fine with prohibiting non-unique IDs, but why do you plan to keep the
most recent in case of a conflict? I'd expect any duplicate (that we can
find out) is rejected / killed / banned / unchurched.

On 9 Dec 2016 8:13 pm, "Joris Van Remoortere"  wrote:

> Hey Neil,
>
> I concur that using duplicate task IDs is bad practice and asking for
> trouble.
>
> Could you please clarify *why* you want to use a hashmap? Is your goal to
> remove duplicate task IDs or is this just a side-effect and you have a
> different reason (e.g. performance) for using a hashmap?
>
> I'm wondering why a multi-hashmap is not sufficient. This would be clear if
> you were explicitly *trying* to get rid of duplicates of course :-)
>
> Thanks,
> Joris
>
> —
> *Joris Van Remoortere*
> Mesosphere
>
> On Fri, Dec 9, 2016 at 7:08 AM, Neil Conway  wrote:
>
> > Folks,
> >
> > The master stores a cache of metadata about recently completed tasks;
> > for example, this information can be accessed via the "/tasks" HTTP
> > endpoint or the "GET_TASKS" call in the new Operator API.
> >
> > The master currently stores this metadata using a list; this means
> > that duplicate task IDs are permitted. We're considering [1] changing
> > this to use a hashmap instead. Using a hashmap would mean that
> > duplicate task IDs would be discarded: if two completed tasks have the
> > same task ID, only the metadata for the most recently completed task
> > would be retained by the master.
> >
> > If this behavior change would cause problems for your framework or
> > other software that relies on Mesos, please let me know.
> >
> > (Note that if you do have two completed tasks with the same ID, you'd
> > need an unambiguous way to tell them apart. As a recommendation, I
> > would strongly encourage framework authors to never reuse task IDs.)
> >
> > Neil
> >
> > [1] https://reviews.apache.org/r/54179/
> >
>


Re: Quota

2016-12-11 Thread Alex Rukletsov
Granularity in the allocator is a single agent. Hence even though you set
quota for 0.0001 CPU, at least one agent is "blocked". This is probably the
reason why marathon is not getting offers. You can turn verbose master logs
and check allocator messages to confirm.

Alex.

On 10 Dec 2016 2:14 am, "Vijay"  wrote:

> The dispatcher needs 1cpu and 1G memory.
>
> Regards,
> Vijay
>
> Sent from my iPhone
>
> > On Dec 9, 2016, at 4:51 PM, Vinod Kone  wrote:
> >
> > And how many resources does spark need?
> >
> >> On Fri, Dec 9, 2016 at 4:05 PM, Vijay Srinivasaraghavan <
> vijikar...@yahoo.com> wrote:
> >> Here is the slave state info. I see marathon is registered as
> "slave_public" role and is configured with "default_accepted_resource_roles"
> as "*"
> >>
> >> "slaves":[
> >>   {
> >>  "id":"69356344-e2c4-453d-baaf-22df4a4cc430-S0",
> >>  "pid":"slave(1)@xxx.xxx.xxx.100:5051",
> >>  "hostname":"xxx.xxx.xxx.100",
> >>  "registered_time":1481267726.19244,
> >>  "resources":{
> >> "disk":12099.0,
> >> "mem":14863.0,
> >> "gpus":0.0,
> >> "cpus":4.0,
> >> "ports":"[1025-2180, 2182-3887, 3889-5049, 5052-8079,
> 8082-8180, 8182-32000]"
> >>  },
> >>  "used_resources":{
> >> "disk":0.0,
> >> "mem":0.0,
> >> "gpus":0.0,
> >> "cpus":0.0
> >>  },
> >>  "offered_resources":{
> >> "disk":0.0,
> >> "mem":0.0,
> >> "gpus":0.0,
> >> "cpus":0.0
> >>  },
> >>  "reserved_resources":{
> >>
> >>  },
> >>  "unreserved_resources":{
> >> "disk":12099.0,
> >> "mem":14863.0,
> >> "gpus":0.0,
> >> "cpus":4.0,
> >> "ports":"[1025-2180, 2182-3887, 3889-5049, 5052-8079,
> 8082-8180, 8182-32000]"
> >>  },
> >>  "attributes":{
> >>
> >>  },
> >>  "active":true,
> >>  "version":"1.0.1"
> >>   }
> >>],
> >>
> >> Regards
> >> Vijay
> >> On Friday, December 9, 2016 3:48 PM, Vinod Kone 
> wrote:
> >>
> >>
> >> How many resources does the agent register with the master? How many
> resources does spark task need?
> >>
> >> I'm guessing marathon is not registered with "test" role so it is only
> getting un-reserved resources which are not enough for spark task?
> >>
> >> On Fri, Dec 9, 2016 at 2:54 PM, Vijay Srinivasaraghavan <
> vijikar...@yahoo.com> wrote:
> >> I have a standalone DCOS setup (Single node Vagrant VM running DCOS
> v.1.9-dev build + Mesos 1.0.1 + Marathon 1.3.0). Both master and agent are
> running on same VM.
> >>
> >> Resource: 4 CPU, 16GB Memory, 20G Disk
> >>
> >> I have created a quota using new V1 API which creates a role "test"
> with resource constraints of 0.5 CPU and 1G Memory.
> >>
> >> When I try to deploy Spark package, Marathon receives the request but
> the task is in "waiting" state since it did not receive any offers from
> Master though I don't see any resource constraints from the hardware
> perspective.
> >>
> >> However, when I deleted the quota, Marathon is able to move forward
> with the deployment and Spark was deployed/up and running. I could see from
> the Mesos master logs that it had sent an offer to the Marathon framework.
> >>
> >> To debug the issue, I was trying to create a quota but this time did
> not provide any CPU and Memory (0 cpu and 0 mem). After this, when I try to
> deploy Spark from DCOS UI, I could see Marathon getting offer from Master
> and able to deploy Spark without the need to delete the quota this time.
> >>
> >> Did anyone notice similar behavior?
> >>
> >> Regards
> >> Vijay
> >>
> >>
> >>
> >
>


Re: healthcheck task?

2016-12-07 Thread Alex Rukletsov
We currently do not provide necessary primitives to run sidecar tasks
(which is necessary to ensure your "health check" task is run alongside
your "payload" task). However, as you have mentioned yourself, some
frameworks do provide health check functionality, e.g., Marathon.

In Mesos 1.2 (also already available in the master branch) we will
introduce Mesos-native health checks, which every framework that uses
built-in executor can leverage. The doc is here [1]. There are no examples
in java, but I assume code snippets in C++ can be directly translated into
java since it's just populating protobuf messages.

[1] https://github.com/apache/mesos/blob/master/docs/health-checks.md

On Wed, Dec 7, 2016 at 10:19 PM, Victor L <vlyamt...@gmail.com> wrote:

> I found javadocs for package "Protos.HealthCheck":
> http://mesos.apache.org/api/latest/java/org/apache/mesos/
> Protos.HealthCheck.html
>  but not a single example of how to use it
>
> On Wed, Dec 7, 2016 at 11:22 AM, Alex Rukletsov <a...@mesosphere.com>
> wrote:
>
>> What exactly do you mean under "health check task"?
>>
>> On Wed, Dec 7, 2016 at 5:09 PM, Victor L <vlyamt...@gmail.com> wrote:
>>
>>> Can someone recommend simple example of how to add healthcheck task to
>>> java framework?
>>> Thanks,
>>>
>>>
>>
>


Re: healthcheck task?

2016-12-07 Thread Alex Rukletsov
What exactly do you mean under "health check task"?

On Wed, Dec 7, 2016 at 5:09 PM, Victor L  wrote:

> Can someone recommend simple example of how to add healthcheck task to
> java framework?
> Thanks,
>
>


Re: [VOTE] Release Apache Mesos 0.28.3 (rc1)

2016-11-28 Thread Alex Rukletsov
I see LinuxFilesystemIsolatorTest.ROOT_ChangeRootFilesystem failing on
CentOS 7 and Fedora 23, see e.g., [1]. I don't see any backports touching
[2], can it be a regression or this test is know to be problematic in
0.28.x?

[1] http://pastebin.com/c5PzfGF8
[2]
https://github.com/apache/mesos/blob/0.28.x/src/tests/containerizer/filesystem_isolator_tests.cpp

On Thu, Nov 24, 2016 at 12:07 AM, Anand Mazumdar  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 0.28.3.
>
>
> 0.28.3 includes the following:
> 
> 
>
> ** Bug
>   * [MESOS-2043] - Framework auth fail with timeout error and never
> get authenticated
>   * [MESOS-4638] - Versioning preprocessor macros.
>   * [MESOS-5073] - Mesos allocator leaks role sorter and quota role
> sorters.
>   * [MESOS-5330] - Agent should backoff before connecting to the master.
>   * [MESOS-5390] - v1 Executor Protos not included in maven jar
>   * [MESOS-5543] - /dev/fd is missing in the Mesos containerizer
> environment.
>   * [MESOS-5571] - Scheduler JNI throws exception when the major
> versions of JAR and libmesos don't match.
>   * [MESOS-5576] - Masters may drop the first message they send
> between masters after a network partition.
>   * [MESOS-5673] - Port mapping isolator may cause segfault if it bind
> mount root does not exist.
>   * [MESOS-5691] - SSL downgrade support will leak sockets in CLOSE_WAIT
> status.
>   * [MESOS-5698] - Quota sorter not updated for resource changes at agent.
>   * [MESOS-5723] - SSL-enabled libprocess will leak incoming links to
> forks.
>   * [MESOS-5740] - Consider adding `relink` functionality to libprocess.
>   * [MESOS-5748] - Potential segfault in `link` when linking to a
> remote process.
>   * [MESOS-5763] - Task stuck in fetching is not cleaned up after
> --executor_registration_timeout.
>   * [MESOS-5913] - Stale socket FD usage when using libevent + SSL.
>   * [MESOS-5927] - Unable to run "scratch" Dockerfiles with Unified
> Containerizer.
>   * [MESOS-5943] - Incremental http parsing of URLs leads to decoder error.
>   * [MESOS-5986] - SSL Socket CHECK can fail after socket receives EOF.
>   * [MESOS-6104] - Potential FD double close in libevent's
> implementation of `sendfile`.
>   * [MESOS-6142] - Frameworks may RESERVE for an arbitrary role.
>   * [MESOS-6152] - Resource leak in libevent_ssl_socket.cpp.
>   * [MESOS-6233] - Master CHECK fails during recovery while relinking
> to other masters.
>   * [MESOS-6234] - Potential socket leak during Zookeeper network changes.
>   * [MESOS-6246] - Libprocess links will not generate an ExitedEvent
> if the socket creation fails.
>   * [MESOS-6299] - Master doesn't remove task from pending when it is
> invalid.
>   * [MESOS-6457] - Tasks shouldn't transition from TASK_KILLING to
> TASK_RUNNING.
>   * [MESOS-6502] - _version uses incorrect
> MESOS_{MAJOR,MINOR,PATCH}_VERSION in libmesos java binding.
>   * [MESOS-6527] - Memory leak in the libprocess request decoder.
>   * [MESOS-6621] - SSL downgrade path will CHECK-fail when using both
> temporary and persistent sockets
>
>
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_
> plain;f=CHANGELOG;hb=0.28.3-rc1
> 
> 
>
> The candidate for Mesos 0.28.3 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/0.28.3-rc1/
> mesos-0.28.3.tar.gz
>
> The tag to be voted on is 0.28.3-rc1:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.28.3-rc1
>
> The MD5 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/0.28.3-rc1/
> mesos-0.28.3.tar.gz.md5
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/0.28.3-rc1/
> mesos-0.28.3.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is up in Maven in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1170
>
> Please vote on releasing this package as Apache Mesos 0.28.3!
>
> The vote is open until Sat Nov 26 14:59:10 PST 2016 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 0.28.3
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Anand & Joseph
>


On increasing visibility into experimental features.

2016-11-01 Thread Alex Rukletsov
Folks,

Additionally to the "known bugs" proposal in a parallel thread, we think
that maintaining a list of still experimental features for each minor
release will significantly help users
to adjust their expectations.

Our suggestion is to include a new section into the CHANGELOG called
"Experimental Features" starting with the upcoming 1.1.0 release.
Populating this section should be relatively easy: take the contents of
this section from the previous minor release, remove features declared
stable, and add new experimental features.

With this change users will have a complete overview of experimental
functionality per release, without searching the CHANGELOG for when and
whether a certain feature became production-ready.

What do you think?

AlexR.


Transition TASK_KILLING -> TASK_RUNNING

2016-10-31 Thread Alex Rukletsov
We've recently discovered a bug that may lead to a task being transitioned
from killing to running state. More information about it in MESOS-6457 [1].
We plan to fix it in 1.2.0 and will backport it to all supported versions.

[1] https://issues.apache.org/jira/browse/MESOS-6457


Re: [VOTE] Release Apache Mesos 1.1.0 (rc1)

2016-10-25 Thread Alex Rukletsov
This vote is cancelled. We'll cut RC2 later this week after the blockers
are resolved.

On Tue, Oct 25, 2016 at 5:48 AM, Zameer Manji  wrote:

> I'm going to -1 (non binding) for the same reason as David Robinson.
>
> I would classify the FD leak as serious and a violation of the isolation
> that the agent provides.
>
> It should be back ported to 1.1.0 just like how it was backported to 1.0.2
>
> On Mon, Oct 24, 2016 at 5:37 PM, David Robinson 
> wrote:
>
>> -1
>>
>> Can the fix for MESOS-6420 be backported? The Mesos agent leaks sockets
>> when the port mapping network isolator is enabled, the leaked sockets are
>> passed to the executor (the close-on-exec flag is not set) and that can
>> cause problems for certain frameworks. The Aurora executor uses Kazoo (the
>> python ZooKeeper library) for service announcement, Kazoo uses Python's
>> select() call for polling its file descriptors and Python's select() chokes
>> when there's > 1024 file descriptors. The end result for Aurora is that
>> after an agent runs > 1024 tasks any new tasks will fail to announce (will
>> not be registered in ZooKeeper) and will therefore be unknown to other
>> services.
>>
>> On Tue, Oct 18, 2016 at 1:01 PM, Till Toenshoff  wrote:
>>
>>> Hi all,
>>>
>>> Please vote on releasing the following candidate as Apache Mesos 1.1.0.
>>>
>>>
>>> 1.1.0 includes the following:
>>> 
>>> 
>>>   * [MESOS-2449] - **Experimental** support for launching a group of
>>> tasks
>>> via a new `LAUNCH_GROUP` Offer operation. Mesos will guarantee that
>>> either
>>> all tasks or none of the tasks in the group are delivered to the
>>> executor.
>>> Executors receive the task group via a new `LAUNCH_GROUP` event.
>>>
>>>   * [MESOS-2533] - **Experimental** support for HTTP and HTTPS health
>>> checks.
>>> Executors may now use the updated `HealthCheck` protobuf to implement
>>> HTTP(S) health checks. Both default executors (command and docker)
>>> leverage
>>> `curl` binary for sending HTTP(S) requests and connect to
>>> `127.0.0.1`,
>>> hence a task must listen on all interfaces. On Linux, For BRIDGE and
>>> USER
>>> modes, docker executor enters the task's network namespace.
>>>
>>>   * [MESOS-3421] - **Experimental** Support sharing of resources across
>>> containers. Currently persistent volumes are the only resources
>>> allowed to
>>> be shared.
>>>
>>>   * [MESOS-3567] - **Experimental** support for TCP health checks.
>>> Executors
>>> may now use the updated `HealthCheck` protobuf to implement TCP
>>> health
>>> checks. Both default executors (command and docker) connect to
>>> `127.0.0.1`,
>>> hence a task must listen on all interfaces. On Linux, For BRIDGE and
>>> USER
>>> modes, docker executor enters the task's network namespace.
>>>
>>>   * [MESOS-4324] - Allow access to persistent volumes as read-only or
>>> read-write
>>> by tasks. Mesos doesn't allow persistent volumes to be created as
>>> read-only
>>> but in 1.1 it starts allow tasks to use the volumes as read-only.
>>> This is
>>> mainly motivated by shared persistent volumes but applies to regular
>>> persistent volumes as well.
>>>
>>>   * [MESOS-5275] - **Experimental** support for linux capabilities.
>>> Frameworks
>>> or operators now have fine-grained control over the capabilities
>>> that a
>>> container may have. This allows a container to run as root, but not
>>> have all
>>> the privileges associated with the root user (e.g., CAP_SYS_ADMIN).
>>>
>>>   * [MESOS-5344] -- **Experimental** support for partition-aware Mesos
>>> frameworks. In previous Mesos releases, when an agent is partitioned
>>> from
>>> the master and then reregisters with the cluster, all tasks running
>>> on the
>>> agent are terminated and the agent is shutdown. In Mesos 1.1,
>>> partitioned
>>> agents will no longer be shutdown when they reregister with the
>>> master. By
>>> default, tasks running on such agents will still be killed (for
>>> backward
>>> compatibility); however, frameworks can opt-in to the new
>>> PARTITION_AWARE
>>> capability. If they do this, their tasks will not be killed when a
>>> partition
>>> is healed. This allows frameworks to define their own policies for
>>> how to
>>> handle partitioned tasks. Enabling the PARTITION_AWARE capability
>>> also
>>> introduces a new set of task states: TASK_UNREACHABLE, TASK_DROPPED,
>>> TASK_GONE, TASK_GONE_BY_OPERATOR, and TASK_UNKNOWN. These new states
>>> are
>>> intended to eventually replace the TASK_LOST state.
>>>
>>>   * [MESOS-6077] - **Experimental** A new default executor is introduced
>>> which
>>> frameworks can use to launch task groups as nested containers. All
>>> the
>>> nested containers share resources likes cpu, memory, network and
>>> volumes.

On Mesos versioning and deprecation policy

2016-10-12 Thread Alex Rukletsov
Folks,

There have been a bunch of online [1, 2] and offline discussions about our
deprecation and versioning policy. I found that people—including
myself—read the versioning doc [3] differently; moreover some aspects are
not captured there. I would like to start a discussion around this topic by
sharing my confusions and suggestions. This will hopefully help us stay on
the same page and have similar expectations. The second goal is to
eliminate ambiguities from the versioning doc (thanks Vinod for
volunteering to update it).

1. API vs. semantic changes.
Current versioning guide treat features (e.g. flags, metrics, endpoints)
and API differently: incompatible changes for the former are allowed after
6 month deprecation cycle, while for the latter they require bumping a
major version. I suggest we consolidate these policies.

We should also define and clearly explain what changes require bumping the
major version. I have no strong opinion here and would love to hear what
people think. The original motivation for maintaining backwards
compatibility is to make sure vN schedulers can correctly work with vN API
without being updated. But what about semantic changes that do not touch
the API? For example, what if we decide to send less task health updates to
schedulers based on some health policy? It influences the flow of task
status updates, should such change be considered compatible? Taking it to
an extreme, we may not even be able to fix some bugs because someone may
already rely on this behaviour!

Another tightly related thing we should explicitly call out is
upgradability and rollback capabilities inside a major release. Committing
to this may significantly limit what we can change within a major release;
on the other side it will give users more time and a better experience
about using and maintaining Mesos clusters.

2. Versioned vs. unversioned protobufs.
Currently we have v1 and unnamed protobufs, which simultaneously mean v0,
v2, and internal. I am sometimes confused about what is the right way to
update or introduce a field or message there, do people feel the same? How
about splitting the unnamed version into explicit v0, v2, and internal?

Food for thought. It would be great if we can only maintain "diffs" to the
internal protobufs in the code, instead of duplicating them altogether.

3. API and feature labelling.
I suggest to introduce explicit labels for API and features, to ensure
users have the right assumptions about the their lifetime while engineers
have the ability to change a wip feature in an non-compatible way. I
propose the following:
API: stable, non-stable, pure (not used by Mesos components)
Feature: experimental, normal.

Looking forward to your thoughts and suggestions.
AlexR

[1] https://www.mail-archive.com/user@mesos.apache.org/msg08025.html
[2] https://www.mail-archive.com/dev@mesos.apache.org/msg36621.html
[3]
https://github.com/apache/mesos/blob/b2beef37f6f85a8c75e968136caa7a1f292ba20e/docs/versioning.md


Re: How to shutdown mesos-agent gracefully?

2016-10-12 Thread Alex Rukletsov
To make sure: you are aware of SIGUSR1?

On Tue, Oct 11, 2016 at 5:37 PM, tommy xiao  wrote:

> Hi Ma,
>
> could you please input more background, why Maintenance feature  is not
> best option for your request?
>
> 2016-10-11 14:47 GMT+08:00 haosdent :
>
> > gracefully means not affect running tasks?
> >
> > On Tue, Oct 11, 2016 at 2:36 PM, Klaus Ma 
> wrote:
> >
> >> It seems there's not a way to shutdown mesos-agent gracefully.
> >> Maintenance feature expect the agents re-register back in the future.
> >>
> >> Thanks
> >> Klaus
> >> --
> >>
> >> Regards,
> >> 
> >> Da (Klaus), Ma (马达), PMP® | Software Architect
> >> IBM Platform Development & Support, STG, IBM GCG
> >> +86-10-8245 4084 | mad...@cn.ibm.com | http://k82.me
> >>
> >
> >
> >
> > --
> > Best Regards,
> > Haosdent Huang
> >
>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>


Re: 1.1.0 release

2016-10-12 Thread Alex Rukletsov
Folks,

we have 23 unresolved tickets targeted for Mesos 1.1.0 release, including 7
blockers and 3 epics (MESOS-5344, MESOS-3421, MESOS-2449), which turns 23
into 55. Obviously, we can’t make a cut today.

Shepherds please either commit your blockers by Thu EOD PST or declare them
as non-blockers. For unfinished epics, please transition all unresolved
tickets to a new epic (see previous email) or retarget the epic. Make sure
CHANGELOG is in good shape.

We strive to cut the release on Fri Oct 14 around 13:00 CEST. At that time
we will bulk-transit all unresolved tickets to 1.2.

Rigorously,
Alex & Till

On Tue, Oct 11, 2016 at 5:30 PM, Alex Rukletsov <a...@mesosphere.io> wrote:

> Folks,
>
> in preparation for Mesos 1.1.0 release we would like to ask people who
> have worked on features in 1.1.0 to either:
> * update the CHANGELOG and declare the feature implemented or
> experimental, make sure documentation is updated as well;
> * postpone to 1.2 and update the related epic;
> * promote an experimental feature to stable if necessary.
>
> If you think you need to land something in 1.1.0, please mark the
> respective JIRA as a blocker and set the target version to 1.1.0. Bear in
> mind the release cut will be cut *tomorrow*, Oct 12 2016.
>
> For experimental features, consider creating a separate epic and moving
> all unresolved tickets there, while marking the original epic as resolved
> for 1.1.0. For example, see MESOS-2449 (pods) and MESOS-6355
> (pods-improvements).
>
> Below is the list of candidates for the CHAGELOG update with their
> respective owners:
> MESOS-6014 CNI port-mapping Avinash, Jie
> MESOS-2449 Pods, subtopics: nested containers, nested isolators, default
> executor Vinod
> MESOS-5676 New Mesos CLI Kevin
> MESOS-4697 Unified Cgroups isolator Haosdent, Jie
> MESOS-6007 v1 API Anand, Vinod
> MESOS-3302 - // -
> MESOS-4855 - // -
> MESOS-4791 - // -
> MESOS-4766 Allocator performance BenM
> MESOS-4936 Container security Jie
> MESOS-4936 Capabilities and container security Benjamin Bannier, Jie
> MESOS-3421 Shared resources Yan Xu
> MESOS-5344 Partition awareness  Neil
>
> Below is the list of features marked as experimental in 1.0. Are they
> ready to be promoted and called out in the CHANGELOG?
> MESOS-4312 Power PC Vinod
> MESOS-4828 XFS disk isolator Yan Xu
> MESOS-4641 Network CNI isolator Qian, Jie
> MESOS-3094 Mesos tasks on Windows Joseph
> MESOS-4355 Docker volume isolator Guangya, Qian, Jie
>
> This one has never been even called experimental. Joseph, is it time to do
> so?
> MESOS-898 CMake (never declared even experimental) Joseph
>
> Thanks in advance for cooperation,
> Till and AlexR
>
> On Fri, Oct 7, 2016 at 7:47 PM, Vinod Kone <vinodk...@apache.org> wrote:
>
>> I think you need to clean up the JIRA a bit.
>>
>> 1) Make sure unresolved tickets do not have fix version (1.1.0) set.
>> 2) Move "Fix version 1.1.0" to "Target version 1.1.0".
>>
>> 2) might obviate the need for 1).
>>
>>
>>
>> On Fri, Oct 7, 2016 at 7:24 AM, Till Toenshoff <toensh...@me.com> wrote:
>>
>>> Hi everyone!
>>>
>>> its us who will be the Release Managers for 1.1.0 - Alex and Till!
>>>
>>> We are planning to cut the next release (1.1.0) within three workdays -
>>> that would be Wednesday next week. So, if you have any patches that need to
>>> get into 1.1.0 make sure that either is already in the master branch or the
>>> corresponding ticket has a target version set to 1.1.0.
>>>
>>> The release dashboard:
>>> https://issues.apache.org/jira/secure/Dashboard.jspa?selectP
>>> ageId=12329720
>>>
>>> Alex & Till
>>>
>>
>>
>


Re: 1.1.0 release

2016-10-11 Thread Alex Rukletsov
Folks,

in preparation for Mesos 1.1.0 release we would like to ask people who have
worked on features in 1.1.0 to either:
* update the CHANGELOG and declare the feature implemented or experimental,
make sure documentation is updated as well;
* postpone to 1.2 and update the related epic;
* promote an experimental feature to stable if necessary.

If you think you need to land something in 1.1.0, please mark the
respective JIRA as a blocker and set the target version to 1.1.0. Bear in
mind the release cut will be cut *tomorrow*, Oct 12 2016.

For experimental features, consider creating a separate epic and moving all
unresolved tickets there, while marking the original epic as resolved for
1.1.0. For example, see MESOS-2449 (pods) and MESOS-6355
(pods-improvements).

Below is the list of candidates for the CHAGELOG update with their
respective owners:
MESOS-6014 CNI port-mapping Avinash, Jie
MESOS-2449 Pods, subtopics: nested containers, nested isolators, default
executor Vinod
MESOS-5676 New Mesos CLI Kevin
MESOS-4697 Unified Cgroups isolator Haosdent, Jie
MESOS-6007 v1 API Anand, Vinod
MESOS-3302 - // -
MESOS-4855 - // -
MESOS-4791 - // -
MESOS-4766 Allocator performance BenM
MESOS-4936 Container security Jie
MESOS-4936 Capabilities and container security Benjamin Bannier, Jie
MESOS-3421 Shared resources Yan Xu
MESOS-5344 Partition awareness  Neil

Below is the list of features marked as experimental in 1.0. Are they ready
to be promoted and called out in the CHANGELOG?
MESOS-4312 Power PC Vinod
MESOS-4828 XFS disk isolator Yan Xu
MESOS-4641 Network CNI isolator Qian, Jie
MESOS-3094 Mesos tasks on Windows Joseph
MESOS-4355 Docker volume isolator Guangya, Qian, Jie

This one has never been even called experimental. Joseph, is it time to do
so?
MESOS-898 CMake (never declared even experimental) Joseph

Thanks in advance for cooperation,
Till and AlexR

On Fri, Oct 7, 2016 at 7:47 PM, Vinod Kone  wrote:

> I think you need to clean up the JIRA a bit.
>
> 1) Make sure unresolved tickets do not have fix version (1.1.0) set.
> 2) Move "Fix version 1.1.0" to "Target version 1.1.0".
>
> 2) might obviate the need for 1).
>
>
>
> On Fri, Oct 7, 2016 at 7:24 AM, Till Toenshoff  wrote:
>
>> Hi everyone!
>>
>> its us who will be the Release Managers for 1.1.0 - Alex and Till!
>>
>> We are planning to cut the next release (1.1.0) within three workdays -
>> that would be Wednesday next week. So, if you have any patches that need to
>> get into 1.1.0 make sure that either is already in the master branch or the
>> corresponding ticket has a target version set to 1.1.0.
>>
>> The release dashboard:
>> https://issues.apache.org/jira/secure/Dashboard.jspa?selectP
>> ageId=12329720
>>
>> Alex & Till
>>
>
>


Re: what is the status on this?

2016-09-21 Thread Alex Rukletsov
Kant,

we would love to walk new community members through the code! We understand
how important it is to have a more experienced member of the community to
help out with patches, hence we have "shepherds". Moreover, though
technically possible, is not advised to start working without having
agreement with your shepherd.

Joseph Wu is driving the effort, get in touch with him and I'm sure you'll
figure out the plan!

On Tue, Sep 13, 2016 at 9:41 PM, kant kodali <kanth...@gmail.com> wrote:

> @Alex Rukletsov I am sorry I took some time to respond. I am very excited
> since the beginning to have an opportunity to work on this task but I
> wanted to take my time if I can really commit to the Task and looks I might
> be able to however I have not contributed to open source before and I would
> need some help from someone who can point me to the right parts of the code
> and basically help me navigate through the process and if that is feasible
> I will be happy to commit some time every week to work on this. please let
> me know if that works.
>
>
>
> On Tue, Sep 6, 2016 11:59 AM, Dario Rexin dre...@apple.com wrote:
>
>> Frameworks would use the redirect mechanism of the HTTP API and in case
>> of unteachable nodes could do round robin on the list of master nodes.
>>
>> On Sep 6, 2016, at 11:52 AM, Joseph Wu <jos...@mesosphere.io> wrote:
>>
>> And for discovery of other nodes in the Paxos group.
>>
>> The work on modularizing/decoupling Zookeeper is a prerequisite for
>> having the replicated log perform leader election itself.  <- That would
>> merely be another implementation of the interface we will introduce in the
>> process:
>>
>> https://issues.apache.org/jira/browse/MESOS-3574
>>
>> On Tue, Sep 6, 2016 at 11:31 AM, Avinash Sridharan <avin...@mesosphere.io
>> > wrote:
>>
>> Also, I think, the replicated log itself uses Zookeeper for leader
>> election.
>>
>> On Tue, Sep 6, 2016 at 12:15 PM, Zameer Manji <zma...@apache.org> wrote:
>>
>> If we use the replicated log for leader election, how will frameworks
>> detect the leading master? Right now the scheduler driver uses the
>> MasterInfo in ZK to discover the leader and detect leadership changes.
>>
>> On Mon, Sep 5, 2016 at 10:18 AM, Dario Rexin <dre...@apple.com> wrote:
>>
>> If we go and change this, why not simply remove any dependencies to
>> external systems and simply use the replicated log for leader election?
>>
>> On Sep 5, 2016, at 9:02 AM, Alex Rukletsov <a...@mesosphere.com> wrote:
>>
>> Kant—
>>
>> thanks a lot for the feedback! Are you interested in helping out with
>> Consul module once Jay and Joseph are done with modularizing patches?
>>
>> On Mon, Sep 5, 2016 at 8:50 AM, Jay JN Guo <guojian...@cn.ibm.com> wrote:
>>
>> Patches are currently under review by @Joseph and can be found at the
>> links provided by @haosdent.
>>
>> I took a quick look at Consul key/value HTTP APIs and they look very
>> similar to Etcd APIs. You could actually reuse our Etcd module
>> implementation once we manage to push the module into Mesos community.
>>
>> The only technical problem I could see for now is that Consul does not
>> support `POST` with incremental key index. We may need to leverage
>> `?cas=` operation in Consul to emulate the behaviour of joining a
>> key group.
>>
>> We could have a discussion on how to implement Consul HA module.
>>
>> cheers,
>> /J
>>
>>
>> - Original message -
>> From: haosdent <haosd...@gmail.com>
>> To: user <user@mesos.apache.org>
>> Cc: Jay JN Guo/China/IBM@IBMCN
>> Subject: Re: what is the status on this?
>> Date: Sun, Sep 4, 2016 6:10 PM
>>
>> Jay has some patches for de-couple Mesos with Zookeeper
>>
>> https://issues.apache.org/jira/browse/MESOS-5828
>> https://issues.apache.org/jira/browse/MESOS-5829
>>
>> I think it should be possible to support consul by custom modules after
>> jay's work done.
>>
>> On Sun, Sep 4, 2016 at 6:02 PM, kant kodali <kanth...@gmail.com> wrote:
>>
>> Hi Alex,
>>
>> We have some experienced devops people here and they all had one thing in
>> common which is Zookeeper is a pain to maintain. In fact we refused to
>> bring in new tech stacks that require Zookeeper such as Kafka for example.
>> so we desperately in search for alternative preferably using consul. I just
>> hear lot of positive response when comes it consul. It will be great to see
>> mesos and consul working t

Re: mesos marathon roles

2016-09-08 Thread Alex Rukletsov
Vincent,

role in a "consumed" resource can be "*", but the allocator will account
this resource based on the consumer's role.

In other words, if your Marathon is registered in role "prod", all "*"
resources it consumes will be accounted for "prod" role. Hence yes, you can
let everything unreserved and wDRF will do exactly what you expect: splits
resource offers between "foo" and "bar" roles according to their weights.

On Thu, Sep 8, 2016 at 5:12 PM, Greg Mann  wrote:

> Vincent,
> You are correct in thinking that you can use weights to affect the
> allocation of resources between frameworks in different roles. While Mesos
> will adjust the offers it sends to frameworks based on their role's
> weighted share, it does not reserve resources in order to accomplish this.
> A resource reservation is created by an operator or framework, not by Mesos
> on its own. Reserved resources are only offered to the role for which they
> are reserved, and they remain reserved until the operator/framework
> unreserves them. Reservations are useful to ensure that specific resources
> will not be used by frameworks outside of a single role.
>
> So, by using weights and roles you can affect how offers will be sent to
> your frameworks, but the resources in those offers will not be reserved
> unless you or one of your frameworks explicitly reserves them itself.
>
> The documentation on roles also has some useful information: http://mesos.
> apache.org/documentation/latest/roles/
>
> Cheers,
> Greg
>
>
>
> On Wed, Sep 7, 2016 at 10:29 PM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Tx
>> If I understand I cannot use the weights to dynamically affect resources
>> among 2 different roles ? I thought I could let everything unreserved (role
>> *) and the DRF would use the weights to assign those unreserved resources
>> to roles "foo" and "bar" ?
>>
>> 2016-09-08 6:19 GMT+02:00 Greg Mann :
>>
>>> Hi Vincent,
>>>
>>>  Can you confirm it's because I didn't set any static reservation ?

>>>
>>> Yes, that's correct.
>>>
>>> So how could I check the resource allocation with multiple marathon
 instances and roles, and configured weights between these roles ? Is
 Marathon supposed to reserve resources with the role it's configured to ?
 If yes how can I check ?

>>>
>>> No, as far as I know, Marathon doesn't automatically reserve any
>>> resources for its role. The '--mesos_role' flag only sets the role that
>>> Marathon will use when registering with the Mesos master. This means that
>>> it will receive offers reserved for that role, but the resource
>>> reservations must be made separately. You could reserve the resources
>>> statically via your agent configuration, or you could use the operator HTTP
>>> API to accomplish this with dynamic reservations: see the section on
>>> operator HTTP endpoints in the reservation docs.
>>>
>>> Let me know if you have any other questions. It's also possible that I'm
>>> unaware of some Marathon feature that could be helpful here - you could
>>> also check the Marathon docs and reach out on their mailing list or IRC.
>>>
>>> Cheers,
>>> Greg
>>>
>>>
>>
>


Re: what is the status on this?

2016-09-05 Thread Alex Rukletsov
Kant—

thanks a lot for the feedback! Are you interested in helping out with
Consul module once Jay and Joseph are done with modularizing patches?

On Mon, Sep 5, 2016 at 8:50 AM, Jay JN Guo <guojian...@cn.ibm.com> wrote:

> Patches are currently under review by @Joseph and can be found at the
> links provided by @haosdent.
>
> I took a quick look at Consul key/value HTTP APIs and they look very
> similar to Etcd APIs. You could actually reuse our Etcd module
> implementation once we manage to push the module into Mesos community.
>
> The only technical problem I could see for now is that Consul does not
> support `POST` with incremental key index. We may need to leverage
> `?cas=` operation in Consul to emulate the behaviour of joining a
> key group.
>
> We could have a discussion on how to implement Consul HA module.
>
> cheers,
> /J
>
>
> - Original message -
> From: haosdent <haosd...@gmail.com>
> To: user <user@mesos.apache.org>
> Cc: Jay JN Guo/China/IBM@IBMCN
> Subject: Re: what is the status on this?
> Date: Sun, Sep 4, 2016 6:10 PM
>
> Jay has some patches for de-couple Mesos with Zookeeper
>
> https://issues.apache.org/jira/browse/MESOS-5828
> https://issues.apache.org/jira/browse/MESOS-5829
>
> I think it should be possible to support consul by custom modules after
> jay's work done.
>
> On Sun, Sep 4, 2016 at 6:02 PM, kant kodali <kanth...@gmail.com> wrote:
>
> Hi Alex,
>
> We have some experienced devops people here and they all had one thing in
> common which is Zookeeper is a pain to maintain. In fact we refused to
> bring in new tech stacks that require Zookeeper such as Kafka for example.
> so we desperately in search for alternative preferably using consul. I just
> hear lot of positive response when comes it consul. It will be great to see
> mesos and consul working together in which we would be ready to jump at it
> and make a switch for YARN to Mesos.
>
> Thanks,
> Kant
>
>
>
>
> On Wed, Aug 31, 2016 1:03 AM, Alex Rukletsov a...@mesosphere.com wrote:
>
> Kant—
>
> mind telling us what is your use case and why this ticket is important for
> you? It will help us prioritize work.
>
> On Fri, Aug 26, 2016 at 2:46 AM, tommy xiao <xia...@gmail.com> wrote:
>
> Hi guys, i always focus on t his case. but good news is etcd always have
> patchs. so the coming consul is very easy, just need some time to do coding
> on it. if you have interesting it? let us collaborate it.
>
> 2016-08-26 8:11 GMT+08:00 Joseph Wu <jos...@mesosphere.io>:
>
> There is no timeline as no one has done any work on the issue.
>
>
> On Thu, Aug 25, 2016 at 4:54 PM, kant kodali <kanth...@gmail.com> wrote:
>
> Hi Guys,
>
> I see this ticket and other related tickets should be part of sprints in
> 2015 and it is still not resolved yet. can we have a timeline on this? This
> would be really helpful
>
> https://issues.apache.org/jira/browse/MESOS-3797
>
> Thanks!
>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>
>
>


Re: [VOTE] Release Apache Mesos 1.0.1 (rc1)

2016-08-12 Thread Alex Rukletsov
+1 (binding)

make check on Mac OS 10.11.6 with apple clang-703.0.31.

DockerFetcherPluginTest.INTERNET_CURL_FetchImage is flaky (MESOS-4570), but
this does not seem to be a regression or a blocker.

On Fri, Aug 12, 2016 at 10:30 PM, Radoslaw Gruchalski 
wrote:

> I am trying to build Mesos 1.0.1 for Centos 7 in a Docker container but
> I'm hitting this: https://issues.apache.org/jira/browse/MESOS-5925.
>
> Kind regards,
>
> Radek Gruchalski
> ra...@gruchalski.com
> +4917685656526
>
> *Confidentiality:*
> This communication is intended for the above-named person and may be
> confidential and/or legally privileged.
> If it has come to you in error you must take no action based on it, nor
> must you copy or show it to anyone; please delete/destroy and inform the
> sender immediately.
>
> On Thu, Aug 11, 2016 at 2:32 AM, Vinod Kone  wrote:
>
>> Hi all,
>>
>>
>> Please vote on releasing the following candidate as Apache Mesos 1.0.1.
>>
>>
>> The CHANGELOG for the release is available at:
>>
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_p
>> lain;f=CHANGELOG;hb=1.0.1-rc1
>>
>> 
>> 
>>
>>
>> The candidate for Mesos 1.0.1 release is available at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.0.1-rc1/mesos-1.0.1.tar.gz
>>
>>
>> The tag to be voted on is 1.0.1-rc1:
>>
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.0.1-rc1
>>
>>
>> The MD5 checksum of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.0.1-rc1/mesos
>> -1.0.1.tar.gz.md5
>>
>>
>> The signature of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.0.1-rc1/mesos
>> -1.0.1.tar.gz.asc
>>
>>
>> The PGP key used to sign the release is here:
>>
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>>
>> The JAR is up in Maven in a staging repository here:
>>
>> https://repository.apache.org/content/repositories/orgapachemesos-1155
>>
>>
>> Please vote on releasing this package as Apache Mesos 1.0.1!
>>
>>
>> The vote is open until Mon Aug 15 17:29:33 PDT 2016 and passes if a
>> majority of at least 3 +1 PMC votes are cast.
>>
>>
>> [ ] +1 Release this package as Apache Mesos 1.0.1
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>> Thanks,
>>
>
>


Re: [VOTE] Release Apache Mesos 1.0.0 (rc2)

2016-07-15 Thread Alex Rukletsov
Haosdent investigated the issue, and it seems that health checks do work
for docker executor. Hence I retract my negative vote.

On Fri, Jul 15, 2016 at 12:57 PM, Alex Rukletsov <a...@mesosphere.com>
wrote:

> -1 (binding): MESOS-5848
> <https://issues.apache.org/jira/browse/MESOS-5848>. The fix is on the way.
>
> On Wed, Jul 13, 2016 at 1:19 AM, Zhitao Li <zhitaoli...@gmail.com> wrote:
>
>> +1 (nonbinding)
>>
>> Tested by 1)running all tests on Mac OS, 2) perform upgrade and downgrade
>> on a small test cluster for both master and slave.
>>
>>
>>
>> On Mon, Jul 11, 2016 at 10:13 AM, Kapil Arya <ka...@mesosphere.io> wrote:
>>
>>> None of the stable builds have SSL yet. The first SSL-enabled stable
>>> build
>>> will be 1.0.0. Sorry for the confusion.
>>>
>>> Kapil
>>>
>>> On Mon, Jul 11, 2016 at 1:03 PM, Zhitao Li <zhitaoli...@gmail.com>
>>> wrote:
>>>
>>> > Hi Kapil,
>>> >
>>> > Do you mean that the stable builds from
>>> > http://open.mesosphere.com/downloads/mesos is using the new
>>> configuration?
>>> >
>>> > On Sun, Jul 10, 2016 at 10:07 AM, Kapil Arya <ka...@mesosphere.io>
>>> wrote:
>>> >
>>> >> The binary rpm/deb packages can be found here:
>>> >>
>>> http://open.mesosphere.com/downloads/mesos-rc/#apache-mesos-1.0.0-rc2
>>> >> .
>>> >>
>>> >> Please note that starting with the 1.0.0 release (including RCs and
>>> >> recent nightly builds), Mesos is configured with SSL and 3rdparty
>>> >> module dependency installation. Here is the configure command line:
>>> >> ./configure --enable-libevent --enable-ssl
>>> >> --enable-install-module-dependencies
>>> >>
>>> >> As always, the stable builds are available at:
>>> >> http://open.mesosphere.com/downloads/mesos
>>> >>
>>> >> The instructions for nightly builds are available at:
>>> >> http://open.mesosphere.com/downloads/mesos-nightly/
>>> >>
>>> >> Best,
>>> >> Kapil
>>> >>
>>> >>
>>> >> On Thu, Jul 7, 2016 at 9:35 PM, Vinod Kone <vinodk...@apache.org>
>>> wrote:
>>> >> >
>>> >> > Hi all,
>>> >> >
>>> >> >
>>> >> > Please vote on releasing the following candidate as Apache Mesos
>>> 1.0.0.
>>> >> >
>>> >> >
>>> >> > 1.0.0 includes the following:
>>> >> >
>>> >> >
>>> >>
>>> 
>>> >> >
>>> >> >   * Scheduler and Executor v1 HTTP APIs are now considered stable.
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >   * [MESOS-4791] - **Experimental** support for v1 Master and Agent
>>> >> APIs.
>>> >> > These
>>> >> >
>>> >> > APIs let operators and services (monitoring, load balancers)
>>> send
>>> >> HTTP
>>> >> >
>>> >> >
>>> >> > requests to '/api/v1' endpoint on master or agent. See
>>> >> >
>>> >> >
>>> >> > `docs/operator-http-api.md` for details.
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >   * [MESOS-4828] - **Experimental** support for a new `disk/xfs'
>>> >> isolator
>>> >> >
>>> >> >
>>> >> > has been added to isolate disk resources more efficiently.
>>> Please
>>> >> refer
>>> >> > to
>>> >> >
>>> >> > docs/mesos-containerizer.md for more details.
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >   * [MESOS-4355] - **Experimental** support for Docker volume
>>> plugin. We
>>> >> > added a
>>> >> >
>>> >> > new isolator 'docker/volume' which allows users to use external
>>> >> volumes
>

Re: [VOTE] Release Apache Mesos 1.0.0 (rc2)

2016-07-15 Thread Alex Rukletsov
-1 (binding): MESOS-5848 .
The fix is on the way.

On Wed, Jul 13, 2016 at 1:19 AM, Zhitao Li  wrote:

> +1 (nonbinding)
>
> Tested by 1)running all tests on Mac OS, 2) perform upgrade and downgrade
> on a small test cluster for both master and slave.
>
>
>
> On Mon, Jul 11, 2016 at 10:13 AM, Kapil Arya  wrote:
>
>> None of the stable builds have SSL yet. The first SSL-enabled stable build
>> will be 1.0.0. Sorry for the confusion.
>>
>> Kapil
>>
>> On Mon, Jul 11, 2016 at 1:03 PM, Zhitao Li  wrote:
>>
>> > Hi Kapil,
>> >
>> > Do you mean that the stable builds from
>> > http://open.mesosphere.com/downloads/mesos is using the new
>> configuration?
>> >
>> > On Sun, Jul 10, 2016 at 10:07 AM, Kapil Arya 
>> wrote:
>> >
>> >> The binary rpm/deb packages can be found here:
>> >>
>> http://open.mesosphere.com/downloads/mesos-rc/#apache-mesos-1.0.0-rc2
>> >> .
>> >>
>> >> Please note that starting with the 1.0.0 release (including RCs and
>> >> recent nightly builds), Mesos is configured with SSL and 3rdparty
>> >> module dependency installation. Here is the configure command line:
>> >> ./configure --enable-libevent --enable-ssl
>> >> --enable-install-module-dependencies
>> >>
>> >> As always, the stable builds are available at:
>> >> http://open.mesosphere.com/downloads/mesos
>> >>
>> >> The instructions for nightly builds are available at:
>> >> http://open.mesosphere.com/downloads/mesos-nightly/
>> >>
>> >> Best,
>> >> Kapil
>> >>
>> >>
>> >> On Thu, Jul 7, 2016 at 9:35 PM, Vinod Kone 
>> wrote:
>> >> >
>> >> > Hi all,
>> >> >
>> >> >
>> >> > Please vote on releasing the following candidate as Apache Mesos
>> 1.0.0.
>> >> >
>> >> >
>> >> > 1.0.0 includes the following:
>> >> >
>> >> >
>> >>
>> 
>> >> >
>> >> >   * Scheduler and Executor v1 HTTP APIs are now considered stable.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >   * [MESOS-4791] - **Experimental** support for v1 Master and Agent
>> >> APIs.
>> >> > These
>> >> >
>> >> > APIs let operators and services (monitoring, load balancers) send
>> >> HTTP
>> >> >
>> >> >
>> >> > requests to '/api/v1' endpoint on master or agent. See
>> >> >
>> >> >
>> >> > `docs/operator-http-api.md` for details.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >   * [MESOS-4828] - **Experimental** support for a new `disk/xfs'
>> >> isolator
>> >> >
>> >> >
>> >> > has been added to isolate disk resources more efficiently. Please
>> >> refer
>> >> > to
>> >> >
>> >> > docs/mesos-containerizer.md for more details.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >   * [MESOS-4355] - **Experimental** support for Docker volume
>> plugin. We
>> >> > added a
>> >> >
>> >> > new isolator 'docker/volume' which allows users to use external
>> >> volumes
>> >> > in
>> >> >
>> >> > Mesos containerizer. Currently, the isolator interacts with the
>> >> Docker
>> >> >
>> >> >
>> >> > volume plugins using a tool called 'dvdcli'. By speaking the
>> Docker
>> >> > volume
>> >> >
>> >> > plugin API, most of the Docker volume plugins are supported.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >   * [MESOS-4641] - **Experimental** A new network isolator, the
>> >> >
>> >> >
>> >> > `network/cni` isolator, has been introduced in the
>> >> > `MesosContainerizer`. The
>> >> >
>> >> > `network/cni` isolator implements the Container Network Interface
>> >> (CNI)
>> >> >
>> >> >
>> >> > specification proposed by CoreOS.  With CNI the `network/cni`
>> >> isolator
>> >> > is
>> >> >
>> >> > able to allocate a network namespace to Mesos containers and
>> attach
>> >> the
>> >> >
>> >> >
>> >> > container to different types of IP networks by invoking network
>> >> drivers
>> >> >
>> >> >
>> >> > called CNI plugins.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >   * [MESOS-2948, MESOS-5403] - The authorizer interface has been
>> >> refactored
>> >> > in
>> >> >
>> >> > order to decouple the ACLs definition language from the
>> interface.
>> >> >
>> >> >
>> >> > It additionally includes the option of retrieving
>> `ObjectApprover`.
>> >> An
>> >> >
>> >> >
>> >> > `ObjectApprover` can be used to synchronously check
>> authorizations
>> >> for
>> >> > a
>> >> >
>> >> > given object and is hence useful when authorizing a large number
>> of
>> >> > objects
>> >> >
>> >> > and/or large objects (which need to be copied using request based
>> >> >
>> >> >
>> >> > authorization). NOTE: This is a **breaking change** for
>> authorizer
>> >> > modules.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >   * [MESOS-5405] - The `subject` and `object` fields in
>> >> > authorization::Request
>> >> >
>> >> > have been changed from required to optional. If either of these
>> >> fields
>> >> > is
>> 

Re: removed slace "ID": (131.154.96.172): health check timed out

2016-04-18 Thread Alex Rukletsov
I believe it's because slaves are able to connect to the master, but the
master is not able to connect to the slaves. That's why you see them
connected for some time and gone afterwards.

On Mon, Apr 18, 2016 at 6:47 PM, Stefano Bianchi 
wrote:

> Indeed, i dont know why, i am not able to reach all the machines from a
> network to the other, just some machines can interconnect with some others
> among the networks.
> On mesos i see that all the slaves at a certain time are all connected,
> then disconnected and after a while connected again, it seems like they are
> able to connect for a while.
> However is an openstack issue i guess.
>
> Does this also happen when master3 is leading? My guess is that you're not
> allowong incoming connections from master1 and master2 to slave3.
> Generally, masters should be able to connect to slaves, not just respond to
> their requests.
> On 18 Apr 2016 13:17, "Stefano Bianchi"  wrote:
>
>> Hi
>> On openstack i plugged two virtual networks to the same virtual router so
>> that the hosts on the 2 networks can communicate each other.
>> this is my topology:
>>
>> ---internet---
>> |
>>Router1
>> |
>> 
>> | |
>> Net1Net2
>> Master1 Master2 Master3
>> Slave1 slave2  Slave3
>>
>> I have set zookeeper in with this line:
>>
>> zk://Master1_IP:2181,Master2_IP:2181,Master3_IP:2181/mesos
>>
>> The 3 masters, even though on 2 separated networks, elect the leader
>> correclty.
>> Now i have started the slaves, and in a first time i see all 3 correctly
>> registered, but after a while the slave 3, independently form who is the
>> master, disconnects.
>> I saw in the log and i get the message in the object.
>> Can you help me to solve this problem?
>>
>>
>> Thanks to all.
>>
>


Re: removed slace "ID": (131.154.96.172): health check timed out

2016-04-18 Thread Alex Rukletsov
Does this also happen when master3 is leading? My guess is that you're not
allowong incoming connections from master1 and master2 to slave3.
Generally, masters should be able to connect to slaves, not just respond to
their requests.
On 18 Apr 2016 13:17, "Stefano Bianchi"  wrote:

> Hi
> On openstack i plugged two virtual networks to the same virtual router so
> that the hosts on the 2 networks can communicate each other.
> this is my topology:
>
> ---internet---
> |
>Router1
> |
> 
> | |
> Net1Net2
> Master1 Master2 Master3
> Slave1 slave2  Slave3
>
> I have set zookeeper in with this line:
>
> zk://Master1_IP:2181,Master2_IP:2181,Master3_IP:2181/mesos
>
> The 3 masters, even though on 2 separated networks, elect the leader
> correclty.
> Now i have started the slaves, and in a first time i see all 3 correctly
> registered, but after a while the slave 3, independently form who is the
> master, disconnects.
> I saw in the log and i get the message in the object.
> Can you help me to solve this problem?
>
>
> Thanks to all.
>


Re: Mesos agents across a WAN?

2016-03-31 Thread Alex Rukletsov
Jeff,

regarding 3: we are investigating this:
https://issues.apache.org/jira/browse/MESOS-3548

On Thu, Mar 31, 2016 at 3:56 AM, Jeff Schroeder 
wrote:

> Given regional bare metal Mesos clusters on multiple continents, are there
> any known issues running some of the agents over the WAN? Is anyone else
> doing it, or is this a terrible idea that I should tell management no on?
>
> A few specifics:
>
> 1. Are there any known limitations or configuration gotchas I might
> encounter?
> 2. Does setting up ZK observers in each non-primary dc and pointing the
> agents at them exclusively make sense?
> 3. Are there plans on a mesos equivalent of something like ubernetes[1],
> or would that be up to each framework?
> 4. Any suggestions on how best to do agent attributes / constraints for
> something like this? I was planning on having the config management add a
> "data_center" agent attribute to match on.
>
> Thanks!
>
> [1]
> https://github.com/kubernetes/kubernetes/blob/8813c955182e3c9daae68a8257365e02cd871c65/release-0.19.0/docs/proposals/federation.md#kubernetes-cluster-federation
>
> --
> Jeff Schroeder
>
> Don't drink and derive, alcohol and analysis don't mix.
> http://www.digitalprognosis.com
>


Re: Executors no longer inherit environment variables from the agent

2016-03-10 Thread Alex Rukletsov
I have two questions.

First, does this change include the executor library? We currently use
environment variables to propagate various config values from an agent to
executors. If it does, what is the alternative?

Second, what will be the preferred way to pass config values to executors?
It would be great to be able to do it uniformly for non-HTTP and HTTP
executors. I can think of several possibilities: cmd flags, adding or
overriding protobufs, extending Executor interface.

On Tue, Mar 8, 2016 at 9:21 PM, Gilbert Song  wrote:

> Yes, `LIBPROCESS_IP` will be excepted from this change. We will still have
> `LIBPROCESS_IP` set and passed to executors' environment, which is for the
> case that DNS is not available on the slave.
>
> Gilbert
>
> On Tue, Mar 8, 2016 at 11:57 AM, Zhitao Li  wrote:
>
>> Is LIBPROCESS_IP going to be an exception to this? Some executors are
>> using this variable as an alternative of implementing their own IP
>> detection logic AFAIK so this behavior would break them.
>>
>> On Tue, Mar 8, 2016 at 11:33 AM, Gilbert Song 
>> wrote:
>>
>>> Hi,
>>>
>>> TL;DR Executors will no longer inherit environment variables from the
>>> agent by default in 0.30.
>>>
>>> Currently, executors are inheriting environment variables form the agent
>>> in mesos containerizer by default. This is an unfortunate legacy behavior
>>> and is insecure. If you do have environment variables that you want to pass
>>> to the executors, you can set it explicitly by using the
>>> `--executor_environment_variables` agent flag.
>>>
>>> Starting from 0.30, we will no longer allow executors to inherit
>>> environment variables from the agent. In other words,
>>> `--executor_environment_variables` will be set to “{}” by default. If you
>>> do depend on the original behavior, please set
>>> `--executor_environment_variables` flag explicitly.
>>>
>>> Let us know if you have any comments or concerns.
>>>
>>> Thanks,
>>> Gilbert
>>>
>>
>>
>>
>> --
>> Cheers,
>>
>> Zhitao Li
>>
>
>


Re: Sync Mesos-Master to Slaves

2015-12-28 Thread Alex Rukletsov
Hi Fred,

hm, if the bug dependents on Ubuntu version, my random guess is that it's
systemd related. Were you able to solve the problem? If not, it would be
helpful if you provide more context and describe a minimal setup that
reproduces the issue.

On Thu, Dec 10, 2015 at 10:15 AM, Frederic LE BRIS <fleb...@pagesjaunes.fr>
wrote:

> Thanks Alex.
>
> About the context, we use spark on mesos and marathon to launch some
> elastisearch,
>
> I kill each leader one-by-one.
>
> By the way as I said, we are on a config Mesos-Master on ubuntu 12, and
> mesos-slave on ubuntu 14, to reproduce this comportement.
>
> When I deploy only on Ubuntu 14 master+slave, the issue disappear …
>
> Fred
>
>
>
>
>
>
> On 09 Dec 2015, at 16:30, Alex Rukletsov <a...@mesosphere.com> wrote:
>
> Frederic,
>
> I have skimmed through the logs and they are do not seem to be complete
> (especially for master1). Could you please say what task has been killed
> (id) and which master failover triggered that? I see at least three
> failovers in the logs : ). Also, could you please share some background
> about your setup? I believe you're on systemd, do you use docker tasks?
>
> To connect our conversation to particular events, let me post here the
> chain of (potentially) interesting events and some info I mined from the
> logs.
> master1: 192.168.37.59 ?
> master2: 192.168.37.58
> master3: 192.168.37.104
>
> timestamp   observed by   event
> 13:48:38 master1  master1 killed by sigterm
> 13:48:48 master2,3   new leader elected (192.168.37.104), id=5
> 13:49:25 master2  master2 killed by sigterm
> 13:50:44 master2,3   new leader elected (192.168.37.59), id=7
> 14:23:34 master1  master1 killed by sigterm
> 14:23:44 master2,3   new leader elected (192.168.37.58), id=8
>
> One interesting thing I cannot understand is why master3 did not commit
> suicide when it lost leadership?
>
>
> On Mon, Dec 7, 2015 at 4:08 PM, Frederic LE BRIS <fleb...@pagesjaunes.fr>
> wrote:
>
>> With the context .. sorry
>>
>>
>
>


Re: mesos-elasticsearch vs Elasticsearch with Marathon

2015-12-28 Thread Alex Rukletsov
Craig,

mind elaborating, how exactly do you run elasticsearch in Marathon?

On Mon, Dec 28, 2015 at 8:36 PM, craig w <codecr...@gmail.com> wrote:
> In terms of discovery, elasticsearch provides that out of the box
> https://www.elastic.co/guide/en/elasticsearch/reference/1.4/modules-discovery.html.
> We deploy elasticsearch via Marathon and it works great.
>
> On Mon, Dec 28, 2015 at 2:17 PM, Eric LEMOINE <elemo...@mirantis.com> wrote:
>>
>> On Mon, Dec 28, 2015 at 7:55 PM, Alex Rukletsov <a...@mesosphere.com>
>> wrote:
>> > Eric—
>> >
>> > give me a chance to answer that before you fall into frustration : ).
>> > Also, you can directly write to framework developers
>> > (mesos...@container-solutions.com) and they either confirm or bust my
>> > guess. Or maybe one of the authors — Frank — will chime in in this
>> > thread.
>> >
>> > Marathon has no idea about application logic, hence a "scale"
>> > operation just starts more application instances. But sometimes you
>> > may want to do extra job (track instances, report ip:port of a new
>> > instance to existing instances, and so on). That's when a dedicated
>> > framework makes sense. Each framework has a scheduler that is able to
>> > track each instance and do all aforementioned actions.
>> >
>> > How this maps to your question? AFAIK, all Elasticsearch nodes should
>> > see each other, hence once a new node is started, it should be somehow
>> > advertised to other nodes. You can do it by wrapping Elasticsearch
>> > command in a shell script and maintain some sort of an out-of-band
>> > registry, take a look at one of the first efforts [1] to run
>> > Elasticsearch on Mesos to get an impression how it may look like. But
>> > you can use a dedicated framework instead : ).
>> >
>> > [1] https://github.com/mesosphere/elasticsearch-mesos
>>
>>
>> That makes great sense Alex. Thanks for chiming in.
>
>
>
>
> --
>
> https://github.com/mindscratch
> https://www.google.com/+CraigWickesser
> https://twitter.com/mind_scratch
> https://twitter.com/craig_links


Re: mesos-elasticsearch vs Elasticsearch with Marathon

2015-12-28 Thread Alex Rukletsov
Eric—

give me a chance to answer that before you fall into frustration : ).
Also, you can directly write to framework developers
(mesos...@container-solutions.com) and they either confirm or bust my
guess. Or maybe one of the authors — Frank — will chime in in this
thread.

Marathon has no idea about application logic, hence a "scale"
operation just starts more application instances. But sometimes you
may want to do extra job (track instances, report ip:port of a new
instance to existing instances, and so on). That's when a dedicated
framework makes sense. Each framework has a scheduler that is able to
track each instance and do all aforementioned actions.

How this maps to your question? AFAIK, all Elasticsearch nodes should
see each other, hence once a new node is started, it should be somehow
advertised to other nodes. You can do it by wrapping Elasticsearch
command in a shell script and maintain some sort of an out-of-band
registry, take a look at one of the first efforts [1] to run
Elasticsearch on Mesos to get an impression how it may look like. But
you can use a dedicated framework instead : ).

[1] https://github.com/mesosphere/elasticsearch-mesos

On Wed, Dec 23, 2015 at 10:30 AM, Eric LEMOINE  wrote:
> On Tue, Dec 22, 2015 at 10:05 AM, craig w  wrote:
>> We'd like to use the framework once some more features are available (see
>> the road map).
>>
>> Currently we deploy ES in docker using marathon.
>
>
>
> Thank you all for your responses. I get that the situation is not as
> clear as I expected :)


Re: Sync Mesos-Master to Slaves

2015-12-09 Thread Alex Rukletsov
Frederic,

I have skimmed through the logs and they are do not seem to be complete
(especially for master1). Could you please say what task has been killed
(id) and which master failover triggered that? I see at least three
failovers in the logs : ). Also, could you please share some background
about your setup? I believe you're on systemd, do you use docker tasks?

To connect our conversation to particular events, let me post here the
chain of (potentially) interesting events and some info I mined from the
logs.
master1: 192.168.37.59 ?
master2: 192.168.37.58
master3: 192.168.37.104

timestamp   observed by   event
13:48:38 master1  master1 killed by sigterm
13:48:48 master2,3   new leader elected (192.168.37.104), id=5
13:49:25 master2  master2 killed by sigterm
13:50:44 master2,3   new leader elected (192.168.37.59), id=7
14:23:34 master1  master1 killed by sigterm
14:23:44 master2,3   new leader elected (192.168.37.58), id=8

One interesting thing I cannot understand is why master3 did not commit
suicide when it lost leadership?


On Mon, Dec 7, 2015 at 4:08 PM, Frederic LE BRIS 
wrote:

> With the context .. sorry
>
>


Re: Verifying Zero Downtime Upgrade Process For Existing Mesos Cluster

2015-12-07 Thread Alex Rukletsov
Hi Abishek,

I would strongly advise not to skip 6 versions. It's hard to say whether
there were any changes that will prevent 0.25 masters to talk to 0.19
slaves (my intuition says there were some breaking changes to protobufs).
We do *not* support upgrade by skipping version, so please upgrade to 0.20,
wait for stabilization, and repeat the procedure 5 more times.

In the future we may move to another deprecation cycle, but currently we
have a 2-version one.

Mind reporting your experience to the list once you're done? Thanks!

On Thu, Dec 3, 2015 at 10:28 PM, Abishek Ravi  wrote:

> Would the following process enable zero downtime upgrade of Mesos (0.19 to
> 0.25) in an existing Mesos cluster?
>
> 0. From [1] it doesn't seem like there are any incompatible changes
> introduced between 0.19 and 0.25.
> 1. Deploy Mesos(0.25) binaries to unelected master nodes
> 2. Deploy Mesos(0.25) binaries to leading master. This should potentially
> result in master re-election and elect a master which already has
> Mesos(0.25) installed from Step (1).
> 3. Deploy Mesos(0.25) binaries to mesos slave nodes. Existing tasks should
> continue to execute and report to the master after mesos process launch
> (with 0.25 binaries) on the slave node.
>
> Known gotchas:
> 1. Any monitoring built around state.json and stats.json should be updated
> accordingly as endpoints have changed [1].
> 2. Checkpointing should be enabled (It is not automatically enabled in
> 0.19) [2] .
> 3. recovery_timeout for slave nodes should be set to an appropriate value
> depending on how long it takes to install Mesos(0.25) on the slave nodes.
>
> Is any step missing in the upgrade process? Are there other gotchas that
> one needs to be aware of?
>
> [1] http://mesos.apache.org/documentation/latest/upgrades/
> [2] http://mesos.apache.org/documentation/latest/slave-recovery/
>
> Thanks,
> Abishek
>


Re: [VOTE] Release Apache Mesos 0.26.0 (rc3)

2015-12-02 Thread Alex Rukletsov
`make check -j7` — OK
`make distcheck -j7` — fails, probably MESOS-3973
, see hints below.

Both on Mac OS 10.10.4

I see the following lines in the log:
...
libtool: warning: 'libmesos.la' has not been installed in
'/Users/alex/Projects/mesos/build/default/mesos-0.26.0/_inst/lib'
libtool: warning: 'libmesos.la' has not been installed in
'/Users/alex/Projects/mesos/build/default/mesos-0.26.0/_inst/lib'
...
libtool: warning: 'libmesos.la' has not been installed in
'/Users/alex/Projects/mesos/build/default/mesos-0.26.0/_inst/lib'
libtool: warning: 'libmesos.la' has not been installed in
'/Users/alex/Projects/mesos/build/default/mesos-0.26.0/_inst/lib'
...
Cannot uninstall requirement mesos, not installed
Cannot uninstall requirement mesos.cli, not installed
Cannot uninstall requirement mesos.interface, not installed
Cannot uninstall requirement mesos.native, not installed
ERROR: files left after uninstall:
...

On Tue, Dec 1, 2015 at 8:49 PM, Till Toenshoff  wrote:

> Hi friends,
>
> Please vote on releasing the following candidate as Apache Mesos 0.26.0.
>
> The CHANGELOG for the release is available at:
>
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.26.0-rc3
>
> 
>
> The candidate for Mesos 0.26.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/0.26.0-rc3/mesos-0.26.0.tar.gz
>
> The tag to be voted on is 0.26.0-rc3:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.26.0-rc3
>
> The MD5 checksum of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/0.26.0-rc3/mesos-0.26.0.tar.gz.md5
>
> The signature of the tarball can be found at:
>
> https://dist.apache.org/repos/dist/dev/mesos/0.26.0-rc3/mesos-0.26.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is up in Maven in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1091
>
> Please vote on releasing this package as Apache Mesos 0.26.0!
>
> The vote is open until Fri Dec  4 19:00:35 CET 2015 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 0.26.0
> [ ] -1 Do not release this package because …
>
> Thanks,
> Bernd & Till
>
>


Re: Change roles and weights without restarting Mesos

2015-11-27 Thread Alex Rukletsov
Hey Mario,

it's not possible right now, but there are several efforts which intend to
fix it in the nearest future. Take a look at [1] and [2].

[1] https://issues.apache.org/jira/browse/MESOS-3988
[2] https://issues.apache.org/jira/browse/MESOS-3177

On Fri, Nov 27, 2015 at 2:24 PM, Mario Pastorelli <
mario.pastore...@teralytics.ch> wrote:

> Hello,
>
> I was wondering if Mesos supports the possibility to change roles and
> weights at runtime. In YARN, it is possible to reload the configurations
> for roles every 10 seconds and that can be quite helpful.
>
> Thanks,
> Mario
>
> --
> Mario Pastorelli | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastore...@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Mark Schmitz, Dr. Angelica Kohlmann
> Küpper
> Data Privacy Supervisor: Prof. Dr. Donald Alan Kossmann
> 
> Crunching data, one terabyte at a time.
>


Re: Is it possible to monitor resource usage per-task for the same executor?

2015-11-02 Thread Alex Rukletsov
In mesos, resources are isolated and accounted per container. A task is
basically a description, it is up to an executor how to interpret it. In
some cases, for example if an executor *just* creates a message in its
internal queue for incoming tasks, it is almost impossible to track
resource usage per task.
On 2 Nov 2015 2:00 pm, "sujz" <43183...@qq.com> wrote:

> Hi, all:
> If we submit a job to framework like Spark, slave node runs our job
> concurrently with launching multiple tasks within the same container, I am
> not sure these tasks are run in per-process or per-thread? If they are in
> thread, can we  monitor resource usage for each task in mesos?
>
> Thank you!


Re: How to trace offers given to services/frameworks

2015-09-29 Thread Alex Rukletsov
The master logs the number of offers it sends to a framework. If you need
exact information about offer resources and you use the built-in allocator,
run the master with the `GLOG_v=2`, which will trigger detailed allocation
logging in the built-in allocator.

On Tue, Sep 29, 2015 at 10:35 AM, tommy xiao  wrote:

> today i came across this question, i can't answer it. anyone can give a
> favor?
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>


Re: Fwd: [Breaking Change 0.24 & Upgrade path] ZooKeeper MasterInfo change.

2015-09-25 Thread Alex Rukletsov
James—

Marco will correct me if I'm wrong, but my understanding is that this
change does *not* impact what ZooKeeper version you can use with Mesos. We
have changed the format of the message stored in ZK from protobuf to JSON.
This message is needed by frameworks for mesos master leader detection.

HTH,
Alex

On Fri, Sep 25, 2015 at 11:12 AM, CCAAT  wrote:

> On 09/25/2015 08:13 AM, Marco Massenzio wrote:
>
>> Folks:
>>
>> as a reminder, please be aware that as of Mesos 0.24.0, as announced
>> back in June, Mesos Master will write its information (`MasterInfo`) to
>> ZooKeeper in JSON format (see below for details).
>>
>
>
> What versions of Zookeeper are supported by this change? That is, what
> is the oldest version of Zookeeper known to work or not work with this
> change in Mesos?
>
>
> James
>
>
>
>
>
>> If your framework relied on parsing the info (either de-serializing the
>> Protocol Buffer or just looking for an "IP-like" string) this change
>> will be a breaking change.
>>
>> Just to confirm (see also Vinod's comments below) any rolling upgrades
>> (i.e., clusters with 0.22+0.23 and 0.23+0.24) of Mesos will just work.
>>
>> This was in conjunction with the HTTP API release and removing the need
>> for non-C++ developers to have to link with libmesos and have to deal
>> with Protocol Buffers.
>>
>> An example of how to access the new format in Python can be found in [0]
>> and we're happy to help with other languages too.
>> Any questions, please just ask.
>>
>> [0] http://github.com/massenz/zk-mesos
>>
>> Marco Massenzio
>> /Distributed Systems Engineer
>> http://codetrips.com/
>>
>> -- Forwarded message --
>> From: *Vinod Kone* >
>> Date: Wed, Jun 24, 2015 at 4:17 PM
>> Subject: Re: [Breaking Change 0.24 & Upgrade path] ZooKeeper MasterInfo
>> change.
>> To: dev >
>>
>>
>> Just to clarify, any frameworks that are using the Mesos provided bindings
>> (aka libmesos.so) should not worry, as long as the version of the bindings
>> and version of the mesos master are not separated by more than 1 version.
>> In other words, you should be able to live upgrade a cluster from 0.23.0
>> to
>> 0.24.0.
>>
>> For framework schedulers that don't use the bindings (pesos, jesos etc),
>> it
>> is prudent to add support for JSON formatted ZNODE to their master
>> detection code.
>>
>> Thanks,
>>
>> On Wed, Jun 24, 2015 at 4:10 PM, Marco Massenzio > >
>> wrote:
>>
>> Folks,
>>>
>>> as heads-up, we are planning to convert the format of the MasterInfo
>>> information stored in ZooKeeper from the Protocol Buffer binary format to
>>> JSON - this is in conjunction with the HTTP API development, to allow
>>> frameworks *not* to depend on libmesos and other binary dependencies to
>>> interact with Mesos Master nodes.
>>>
>>>  > *NOTE* - there is no change in 0.23 (so any Master/Slave/Framework
>> that is
>>  > currently working in 0.22 *will continue to work* in 0.23 too) but as
>> of
>>
>>> Mesos 0.24, frameworks and other clients relying on the binary format
>>> will
>>> break.
>>>
>>> The details of the design are in this Google Doc:
>>>
>>>
>>> https://docs.google.com/document/d/1i2pWJaIjnFYhuR-000NG-AC1rFKKrRh3Wn47Y2G6lRE/edit
>>>
>>> the actual work is detailed in MESOS-2340:
>>> https://issues.apache.org/jira/browse/MESOS-2340
>>>
>>> and the patch (and associated test) are here:
>>> https://reviews.apache.org/r/35571/
>>> https://reviews.apache.org/r/35815/
>>>
>>>  > *Marco Massenzio*
>>  > *Distributed Systems Engineer*
>>  >
>>
>>
>


Re: Reservations for multiple different agents

2015-09-22 Thread Alex Rukletsov
Rinaldo,

or you may try to install or port svn libs and check whether it works.

On Tue, Sep 22, 2015 at 2:25 AM, Guangya Liu  wrote:

> Hi Rinaldo,
>
> The dynamic reservation endpoint support was introduced in 0.25.0, you may
> want to use the latest code to build.
>
> If build fails on Oracle Linux, please go ahead to file a JIRA ticket to
> get some support.
>
> Thanks,
>
> Guangya
>
> On Tue, Sep 22, 2015 at 8:01 AM, DiGiorgio, Mr. Rinaldo S. <
> rdigior...@pace.edu> wrote:
>
>>
>> On Sep 21, 2015, at 19:33, Guangya Liu  wrote:
>>
>> HI Rinaldo,
>>
>> I think that you can use dynamic reservation feature to achieve this: You
>> can launch your tasks after reservation succeeds.  Actually, all of the
>> dynamic reservation feature with endpoint has been finished except ACL
>> part, so you can use this feature now if you do not care ACL part.
>>
>> Thanks,
>>
>>
>> Many thanks -- I am using 0.23. I am unable to compile 0.24 on Oracle
>> Linux. Do you think I should report the issue on Oracle Linux 7 -- the
>> subversion libraries are not being found.
>> Rinaldo
>>
>>
>> Guangya
>>
>> On Tue, Sep 22, 2015 at 6:32 AM, DiGiorgio, Mr. Rinaldo S. <
>> rdigior...@pace.edu> wrote:
>>
>>> Hi,
>>>
>>>I have some tasks that need to run on different types of agents. I
>>> don’t want the tasks to run unless I am going to have all the resources.
>>> Can someone suggest how I could accomplish that with mesos.  I read about
>>> reservations here:
>>> http://mesos.apache.org/documentation/latest/reservation/
>>>
>>>I could iterate over all the resources I need and if I get them
>>> proceed.
>>>
>>>Is that the only way to do it?
>>>
>>>Any idea when coming soon will be available?
>>>
>>> /reserve (*Coming Soon*)
>>>
>>> Suppose we want to reserve 8 CPUs and 4096 MB of RAM for the ads role
>>> on a slave with id=. We send an HTTP POST request to the
>>> /reserve HTTP endpoint like so:
>>>
>>>
>>> Rinaldo
>>>
>>>
>>>
>>
>>
>


Re: [VOTE] Release Apache Mesos 0.24.0 (rc2)

2015-09-05 Thread Alex Rukletsov
Afaik, Pythontest is flaky on OS X, and should be fine on Ubuntu.
On 4 Sep 2015 10:48 pm, "Bernd Mathiske"  wrote:

> And also Ubuntu 13.10: [  FAILED  ] ExamplesTest.PythonFramework, known
> flaky test, so still +1
>
> On Sep 4, 2015, at 9:11 PM, Bernd Mathiske  wrote:
>
> +1 [binding]
>
> MacOS X (make check)
> CentOS 7 (make distcheck)
> Ubuntu 14.4 (make distcheck)
>
>
> On Sep 3, 2015, at 11:47 PM, Niklas Nielsen  wrote:
>
> +1 - tested on our CI
>
> On Tuesday, September 1, 2015, Vinod Kone  wrote:
>
>> Hi all,
>>
>>
>> Please vote on releasing the following candidate as Apache Mesos 0.24.0.
>>
>>
>> 0.24.0 includes the following:
>>
>>
>> 
>>
>> Experimental support for v1 scheduler HTTP API!
>>
>> This release also wraps up support for fetcher.
>>
>> The CHANGELOG for the release is available at:
>>
>>
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.24.0-rc2
>>
>>
>> 
>>
>>
>> The candidate for Mesos 0.24.0 release is available at:
>>
>>
>> https://dist.apache.org/repos/dist/dev/mesos/0.24.0-rc2/mesos-0.24.0.tar.gz
>>
>>
>> The tag to be voted on is 0.24.0-rc2:
>>
>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.24.0-rc2
>>
>>
>> The MD5 checksum of the tarball can be found at:
>>
>>
>> https://dist.apache.org/repos/dist/dev/mesos/0.24.0-rc2/mesos-0.24.0.tar.gz.md5
>>
>>
>> The signature of the tarball can be found at:
>>
>>
>> https://dist.apache.org/repos/dist/dev/mesos/0.24.0-rc2/mesos-0.24.0.tar.gz.asc
>>
>>
>> The PGP key used to sign the release is here:
>>
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>>
>> The JAR is up in Maven in a staging repository here:
>>
>> https://repository.apache.org/content/repositories/orgapachemesos-1066
>>
>>
>> Please vote on releasing this package as Apache Mesos 0.24.0!
>>
>>
>> The vote is open until Fri Sep  4 17:33:05 PDT 2015 and passes if a
>> majority of at least 3 +1 PMC votes are cast.
>>
>>
>> [ ] +1 Release this package as Apache Mesos 0.24.0
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>> Thanks,
>>
>> Vinod
>>
>
>
>


Re: How does mesos determine how much memory on a node is available for offer?

2015-09-03 Thread Alex Rukletsov
Mesos agent (aka slave) estimates the memory available and advertises all
of it minus 1GB. If there is less than 2GB available, only half is
advertised [1].

[1]:
https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L98

On Thu, Sep 3, 2015 at 4:01 AM, Anand Mazumdar  wrote:

> My bad, Seeing the 1002mb(~1024) number made me think the agent was not
> able to get the memory estimates from the OS and defaulting to the constant
> values.
>
> The slave executes a `sysinfo` system call and populates the memory
> numbers based on it. If you want a more fine-grained control, try to
> specify it directly using the —resources flag as I had mentioned earlier.
>
> -anand
>
> On Sep 2, 2015, at 6:48 PM, F21  wrote:
>
> There seems to be some dynamicness to it. I just bumped the memory for
> each VM up to 2.5GB and now mesos is offering 1.5GB on it's slave. Is there
> some percentage value that I can set so that more memory is available to
> mesos?
>
> On 3/09/2015 11:23 AM, Anand Mazumdar wrote:
>
> In case you don’t specify the resources via “—resources” flag when you
> start your agent, it picks up the default values. (Example:
> --resources="cpus:4;mem:1024;disk:2”)
>
> The default value for memory is here:
> 
> https://github.com/apache/mesos/blob/master/src/slave/constants.cpp#L46
>
> -anand
>
>
>


Re: mesos-master resource offer details

2015-09-02 Thread Alex Rukletsov
If my understanding of how Mesos allocation algorithm works, there are
allocations made if there are offers made. An allocator performs
allocation, which is used by the master to generate offers to frameworks,
which, in turn, may be accepted or declined. Have you tried to increase the
log level for the master as suggested?

To help you with your problem, could you please describe the setup you use?
Specifically, how "fat" your agents (aka slaves) are, what is the task
description you send to marathon, what are the available resources in the
cluster (state.json).

On Wed, Sep 2, 2015 at 7:02 PM, Haripriya Ayyalasomayajula <
aharipriy...@gmail.com> wrote:

> Alex,
>
> The problem I am facing is that there are no allocations made.  Mesos
> -master gives 5 requests to marathon. But marathon DECLINE s all the
> offers. I am trying to debug the reason why it is rejecting the offers. I
> traced down the source code to see that it calls the ResourceMatcher to
> match the resource offered vs. Resource Available and in my case it says it
> has problem with the cpu's offered (not sufficient resources ). I am trying
> to get the details of the resource offer made available - the cpu's being
> offered and I'm stuck there..
>
> I really appreciate if you have any suggestions! Thanks.
>
> On Wed, Sep 2, 2015 at 9:54 AM, Alex Rukletsov <a...@mesosphere.com>
> wrote:
>
>> To what Haosdent said: you cannot get a list of offers from master logs,
>> but you can get a list of allocations from the built-in allocator in you
>> bump up the log level (GLOG_v=2).
>>
>> On Wed, Sep 2, 2015 at 7:36 AM, haosdent <haosd...@gmail.com> wrote:
>>
>>> If the offer is rejected by your framework, could you find this log in
>>> mesos:
>>>
>>> ```
>>> xxx Processing DECLINE call for offers xxx
>>> ```
>>>
>>> On Wed, Sep 2, 2015 at 1:31 PM, haosdent <haosd...@gmail.com> wrote:
>>>
>>>> >Well, the log you mentioned above is when the resource offer is
>>>> accepted and mesos-master then allocates the cpu.
>>>> Hi, @Haripriya As far as i know, the log I show above is allocator
>>>> allocate resource and make a offer. And then trigger Master::offer to send
>>>> offer to frameworks. So the log above is not resource offer is
>>>> accepted, it is before send offer to framework and it also is the details
>>>> about that offer.
>>>>
>>>> For you problem
>>>> >In my case, the offer is being rejected
>>>> If you mean the offer is rejected by your framework after your
>>>> framework receive it? Or you mean your framework never receive offers from
>>>> mesos?
>>>>
>>>>
>>>> On Wed, Sep 2, 2015 at 1:51 AM, Haripriya Ayyalasomayajula <
>>>> aharipriy...@gmail.com> wrote:
>>>>
>>>>> Well, the log you mentioned above is when the resource offer is
>>>>> accepted and mesos-master then allocates the cpu. In my case, the offer is
>>>>> being rejected. I am trying to debug the reason as to why the resource
>>>>> offer is being rejected.
>>>>>
>>>>> On Tue, Sep 1, 2015 at 10:00 AM, haosdent <haosd...@gmail.com> wrote:
>>>>>
>>>>>> Yes, currently only print number for offers in mesos code in default
>>>>>> log level. If you want get more details about it, you could start with 
>>>>>> set
>>>>>> environment variable GLOG_v2=1 Then you should got some similar message
>>>>>> like this:
>>>>>>
>>>>>> I0902 00:55:17.465920 143396864 hierarchical.hpp:935] Allocating
>>>>>> cpus(*):x; mem(*):x; disk(*):x; ports(*):[x-x] on slave
>>>>>> 20150902-005512-16777343-5050-46447-S0 to framework 20150902-00551
>>>>>> 2-16777343-5050-46447-
>>>>>>
>>>>>> But use GLOG_v2 would have a lot of log. If you just want to get the
>>>>>> resources allocated to task or executor, you could get those informations
>>>>>> from slave state.json endpoint.
>>>>>>
>>>>>> On Wed, Sep 2, 2015 at 12:41 AM, Haripriya Ayyalasomayajula <
>>>>>> aharipriy...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks, but is there no way without tweaking the source code of the
>>>>>>> framework scheduler that I get the details of the resource offer? I 
>>>>>>> don't
>>>>>>> see anythin

Re: mesos-master resource offer details

2015-09-02 Thread Alex Rukletsov
To what Haosdent said: you cannot get a list of offers from master logs,
but you can get a list of allocations from the built-in allocator in you
bump up the log level (GLOG_v=2).

On Wed, Sep 2, 2015 at 7:36 AM, haosdent  wrote:

> If the offer is rejected by your framework, could you find this log in
> mesos:
>
> ```
> xxx Processing DECLINE call for offers xxx
> ```
>
> On Wed, Sep 2, 2015 at 1:31 PM, haosdent  wrote:
>
>> >Well, the log you mentioned above is when the resource offer is
>> accepted and mesos-master then allocates the cpu.
>> Hi, @Haripriya As far as i know, the log I show above is allocator
>> allocate resource and make a offer. And then trigger Master::offer to send
>> offer to frameworks. So the log above is not resource offer is accepted,
>> it is before send offer to framework and it also is the details about that
>> offer.
>>
>> For you problem
>> >In my case, the offer is being rejected
>> If you mean the offer is rejected by your framework after your framework
>> receive it? Or you mean your framework never receive offers from mesos?
>>
>>
>> On Wed, Sep 2, 2015 at 1:51 AM, Haripriya Ayyalasomayajula <
>> aharipriy...@gmail.com> wrote:
>>
>>> Well, the log you mentioned above is when the resource offer is accepted
>>> and mesos-master then allocates the cpu. In my case, the offer is being
>>> rejected. I am trying to debug the reason as to why the resource offer is
>>> being rejected.
>>>
>>> On Tue, Sep 1, 2015 at 10:00 AM, haosdent  wrote:
>>>
 Yes, currently only print number for offers in mesos code in default
 log level. If you want get more details about it, you could start with set
 environment variable GLOG_v2=1 Then you should got some similar message
 like this:

 I0902 00:55:17.465920 143396864 hierarchical.hpp:935] Allocating
 cpus(*):x; mem(*):x; disk(*):x; ports(*):[x-x] on slave
 20150902-005512-16777343-5050-46447-S0 to framework 20150902-00551
 2-16777343-5050-46447-

 But use GLOG_v2 would have a lot of log. If you just want to get the
 resources allocated to task or executor, you could get those informations
 from slave state.json endpoint.

 On Wed, Sep 2, 2015 at 12:41 AM, Haripriya Ayyalasomayajula <
 aharipriy...@gmail.com> wrote:

> Thanks, but is there no way without tweaking the source code of the
> framework scheduler that I get the details of the resource offer? I don't
> see anything in my logs.
>
> All I can see is
>
> mesos-master: Sending 5 offers to framework 20150815- (marathon)
> at scheduler-50ajaja@pqr
>
> I can't find any other details in the logs..
>
> On Mon, Aug 31, 2015 at 8:36 PM, haosdent  wrote:
>
>> Hi, Haripriya.
>>
>> >1. I am trying to see the details of the resource offer made by the
>> mesos master. I can see in the logs that there are 5 resource offers made
>> but I am not sure where to get the details of the resource offers - the
>> cpu, memory etc.
>>
>> You could print offer details in your
>> framework Scheduler#resourceOffers methods. These offer message also 
>> could
>> find from mesos log.
>>
>> >2. How can I list the number of slaves registered with the master
>> and the details of the slaves on the command line( apart from seeing it 
>> in
>> the UI)?
>>
>> We have some endpoints(state.json and state-summary) in master and
>> slave to expose these informations, you could got this from
>>
>> ```
>> curl -s "http://localhost:5050/master/state-summary; |jq .slaves
>> ```
>>
>>
>> On Tue, Sep 1, 2015 at 6:47 AM, Haripriya Ayyalasomayajula <
>> aharipriy...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I'm having trouble with some basic details:
>>>
>>> 1. I am trying to see the details of the resource offer made by the
>>> mesos master. I can see in the logs that there are 5 resource offers 
>>> made
>>> but I am not sure where to get the details of the resource offers - the
>>> cpu, memory etc.
>>>
>>> 2. How can I list the number of slaves registered with the master
>>> and the details of the slaves on the command line( apart from seeing it 
>>> in
>>> the UI)?
>>>
>>> Thanks for the help.
>>>
>>> --
>>> Regards,
>>> Haripriya Ayyalasomayajula
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Regards,
> Haripriya Ayyalasomayajula
>
>


 --
 Best Regards,
 Haosdent Huang

>>>
>>>
>>>
>>> --
>>> Regards,
>>> Haripriya Ayyalasomayajula
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>


Re: Use docker start rather than docker run?

2015-08-28 Thread Alex Rukletsov
Heh, that's a tricky one : ). A framework indeed consists of a scheduler
and an executor, both are mandatory. But Mesos provides a default
general-purpose executor, which can be used by frameworks. This executor
has many names, two most common are MesosExecutor and CommandExecutor.
Marathon doesn't have its own executor (in contrast to, say, Aurora) yet,
it uses CommandExecutor for all of its tasks.

CommandExecutor is implicitly created by Mesos if a task specification do
not include executor. This executor can have just a single task and is
garbage collected after the task finishes. A task is any command, which
will be executed via '/bin/sh -c command'.


On Fri, Aug 28, 2015 at 8:50 PM, Tim Chen t...@mesosphere.io wrote:

 We have primitives for persistent volumes in next release (0.25.0) but
 DockerContainerizer integration will happen most likely the version after.

 Tim

 On Fri, Aug 28, 2015 at 11:50 AM, Tim Chen t...@mesosphere.io wrote:

 Hi Paul,

 Alternatively you can try to launch your task on the same host by
 specifying a constraint with marathon and mount a directory on the host in
 your container everytime to work-around as well.

 Tim

 On Fri, Aug 28, 2015 at 11:44 AM, Paul Bell arach...@gmail.com wrote:

 Alex  Tim,

 Thank you both; most helpful.

 Alex, can you dispel my confusion on this point: I keep reading that a
 framework in Mesos (e.g., Marathon) consists of a scheduler and an
 executor. This reference to executor made me think that Marathon must
 have *some* kind of presence on the slave node. But the more familiar I
 become with Mesos the less likely this seems to me. So, what does it mean
 to talk about the Marathon framework executor?

 Tim, I did come up with a simple work-around that involves re-copying
 the needed file into the container each time the application is started.
 For reasons unknown, this file is not kept in a location that would readily
 lend itself to my use of persistent storage (Docker -v). That said, I am
 keenly interested in learning how to write both custom executors 
 schedulers. Any sense for what release of Mesos will see persistent
 volumes?

 Thanks again, gents.

 -Paul



 On Fri, Aug 28, 2015 at 2:26 PM, Tim Chen t...@mesosphere.io wrote:

 Hi Paul,

 We don't [re]start a container since we assume once the task terminated
 the container is no longer reused. In Mesos to allow tasks to reuse the
 same executor and handle task logic accordingly people will opt to choose
 the custom executor route.

 We're working on a way to keep your sandbox data beyond a container
 lifecycle, which is called persistent volumes. We haven't integrated that
 with Docker containerizer yet, so you'll have to wait to use that feature.

 You could also choose to implement a custom executor for now if you
 like.

 Tim

 On Fri, Aug 28, 2015 at 10:43 AM, Alex Rukletsov a...@mesosphere.com
 wrote:

 Paul,

 that component is called DockerContainerizer and it's part of Mesos
 Agent (check
 /Users/alex/Projects/mesos/src/slave/containerizer/docker.hpp). @Tim,
 could you answer the docker start vs. docker run question?

 On Fri, Aug 28, 2015 at 1:26 PM, Paul Bell arach...@gmail.com wrote:

 Hi All,

 I first posted this to the Marathon list, but someone suggested I try
 it here.

 I'm still not sure what component (mesos-master, mesos-slave,
 marathon) generates the docker run command that launches containers on 
 a
 slave node. I suppose that it's the framework executor (Marathon) on the
 slave that actually executes the docker run, but I'm not sure.

 What I'm really after is whether or not we can cause the use of
 docker start rather than docker run.

 At issue here is some persistent data inside
 /var/lib/docker/aufs/mnt/CTR_ID. docker run will by design (re)launch
 my application with a different CTR_ID effectively rendering that data
 inaccessible. But docker start will restart the container and its old
 data will still be there.

 Thanks.

 -Paul









Re: Are the resource options documented?

2015-08-25 Thread Alex Rukletsov
From Mesos point of view, a resource is just a string, your agents may
advertise gpu, bananas, pandas and so on. However, some resources are
known to Mesos, and for them isolation is possible. A good example is a
cgroups isolator for mem resources, which will invoke OOM killer if
necessary. Compare with GPU resources: if your agent advertises, say, 1GB
gpu to the master, a task may accept 100MB, but the agent will have no
control, whether a task uses no more than 100MB, because there is no
isolator for this resource. Good news is that you can write an isolator for
your resource, wrap it into a Mesos module, and let Mesos agent use it!

P.S. cpu is not a known resource, but cpus is.

On Tue, Aug 25, 2015 at 7:31 PM, craig w codecr...@gmail.com wrote:

 When configuring a mesos-slave with --resources, I know cpu, mem and
 ports are available. Are there others? Are these documented somewhere?

 I've found some examples here
 https://open.mesosphere.com/reference/mesos-slave/ and the configuration
 page (http://mesos.apache.org/documentation/latest/configuration/) is
 generic with it's description of --resources.

 Thanks
 craig



Re: Custom Scheduler: Diagnosing cause of container task failures

2015-08-25 Thread Alex Rukletsov
It looks like we can have a better error message here.

@Jay, mind filing a JIRA ticket for with description, status update, and
your fix attached? Thanks!

On Fri, Aug 21, 2015 at 7:36 PM, Jay Taylor j...@jaytaylor.com wrote:

 Eventually I was able to isolate what was going on; in this case the
 FrameworkInfo.User was set to an invalid value and setting it to root did
 the trick.

 My scheduler is now working [in a basic form]!!!

 Cheers,
 Jay

 On Thu, Aug 20, 2015 at 4:15 PM, Jay Taylor j...@jaytaylor.com wrote:

 Hey Tim,

 Thank you for the quick response!

 Just checked the sandbox logs and they are all empty (stdout and stderr
 are both 0 bytes).

 I have discovered a little bit more information from the StatusUpdate
 event posted back to my scheduler:

 TaskStatus{
 TaskId: TaskID{
 Value:*fluxCapacitor-test-1,XXX_unrecognized:[],
 },
 State: *TASK_FAILED,
 Message: *Abnormal executor termination,
 Source: *SOURCE_SLAVE,
 Reason: *REASON_COMMAND_EXECUTOR_FAILED,
 Data:nil,
 SlaveId: SlaveID{
 Value: *20150804-211459-1407297728-5050-5855-S1,
 XXX_unrecognized: [],
 },
 ExecutorId: nil,
 Timestamp: *1.440112075509318e+09,
 Uuid: *[102 75 82 85 38 139 68 94 153 189 210 87 218 235 147 166],
 Healthy: nil,
 XXX_unrecognized: [],
 }

 How can I find out what why the command executor is failing?


 On Thu, Aug 20, 2015 at 4:08 PM, Tim Chen t...@mesosphere.io wrote:

 It received a TASK_FAILED from the executor, so you'll need to look at
 the sandbox logs of your task stdout and stderr files to see what went
 wrong.

 These files should be reachable by the Mesos UI.

 Tim

 On Thu, Aug 20, 2015 at 4:01 PM, Jay Taylor outtat...@gmail.com wrote:

 Hey everyone,

 I am writing a scheduler for Mesos and on of my first goals is to get
 simple a docker container to run.

 The tasks get marked as failed with the failure messages originating
 from the slave logs.  Now I'm not sure how to determine exactly what is
 causing the failure.

 The most informative log messages I've found were in the slave log:

 == /var/log/mesos/mesos-slave.INFO ==
 W0820 20:44:25.242230 29639 docker.cpp:994] Ignoring updating unknown
 container: e190037a-b011-4681-9e10-dcbacf6cb819
 I0820 20:44:25.242270 29639 status_update_manager.cpp:322] Received
 status update TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for
 task jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.242377 29639 slave.cpp:2961] Forwarding the update
 TASK_FAILED (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60) for task
 jay-test-29 of framework 20150804-211741-1608624320-5050-18273-0060 to
 master@63.198.215.105:5050
 I0820 20:44:25.247926 29636 status_update_manager.cpp:394] Received
 status update acknowledgement (UUID: 17a21cf7-17d1-42dd-92eb-b281396ebf60)
 for task jay-test-29 of framework 
 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.248108 29636 slave.cpp:3502] Cleaning up executor
 'jay-test-29' of framework 20150804-211741-1608624320-5050-18273-0060
 I0820 20:44:25.248342 29636 slave.cpp:3591] Cleaning up framework
 20150804-211741-1608624320-5050-18273-0060

 And this doesn't really tell me much about *why* it's failed.

 Is there somewhere else I should be looking or an option that needs to
 be turned on to show more information?

 Your assistance is greatly appreciated!

 Jay







Re: Launching tasks with reserved resources

2015-08-17 Thread Alex Rukletsov
Hi Gidon,

just to make sure, you mean static reservations on mesos agents (via
--resources flag) and not dynamic reservations, right?

Let me first try to explain, why you get the TASK_ERROR message. The
built-in allocator merges '*' and reserved resources, hinting master to
create a single offer. However, as you mentioned before, validation fails,
if you try to mix resources with different role, because the function
responsible for validation checks whether task resources are contained in
offered resources, which obviously includes role equality check. Here are
some source code snippets:
https://github.com/apache/mesos/blob/master/src/master/validation.cpp#L449
https://github.com/apache/mesos/blob/master/src/common/resources.cpp#L598
https://github.com/apache/mesos/blob/master/src/common/resources.cpp#L244
https://github.com/apache/mesos/blob/master/src/common/resources.cpp#L197

Maybe we should split reserved and unreserved resources into two offers?

Now, to your second concern about whether we should disallow tasks using
both '*' and 'role' resources. I see your point: if a framework is entitled
to use reserved and unreserved resources, why not hoard them and launch a
bigger task? I think it's fine, and you should be actually able to do it by
explicitly specifying two different resource objects in the task launch
message, one for '* resources and one for your role. Why cannot you just
use your framework's role for both? Different roles may have different
guarantees (quota, MESOS-1791), and while reserved resources may still be
available for your framework, '* may become unavailable for you (in future
Mesos releases or with custom allocators) leading to the whole task
termination. By requiring two different objects in the task launch message
we motivate the framework — i.e. framework writer — to be aware of
different policies that may be attached to different roles. Does it make
sense?

—Alex

On Thu, Aug 13, 2015 at 2:23 PM, Gidon Gershinsky gi...@il.ibm.com wrote:

 I have a simple setup where a framework runs with a role, and some
 resources are reserved in cluster for that role.
 The resource offers arrive at the framework as a list of two resource
 sets: one general (cpus(*)), etc)  and one specific for the role
 (cpus(role1), etc).

 So far so good. If two tasks are launched, each with one of the two
 resources, things work.

 But problems start when I need to launch multiple smaller tasks (with a
 total resource consumption equal to the offered). I run this by creating
 resource objects, and attaching them to tasks, using calls from the
 standard Mesos samples (python):
 task = mesos_pb2.TaskInfo()
cpus = task.resources.add()
 cpus.name = cpus
 cpus.scalar.value = TASK_CPUS

 checking that total doesnt surpass the offered resources. This starts
 fine, but soon I get TASK_ERROR messages, due to Master validator finding
 that more resources are requested by tasks than available in the offer.
 This obviously happens because all tasks resources, as defined above, come
 with (*) role, while the offer resources are split between * and role1
 ! Ok, then I assign a role to task resources, by adding
cpus.role = role1

 But this fails again, and for the same reason..

 Shouldn't this work differently? When a resource offer is received
  framework with a role1, why should it care which part is 'unreserved'
 and which part is reserved to role1? When a task launch request is
 received by the master, from a framework with a role, why can't it check
 only the total resource amount, instead of treating unreserved and reserved
 resources separately? They are reserved for this role anyway.. Or I'm
 missing something?


 Regards,
 Gidon






Re: Launching tasks with reserved resources

2015-08-17 Thread Alex Rukletsov
 if there were an api for splitting a resource object
I think it's a good idea, resource math is something that each framework
re-implements. We were discussing the idea of providing a framework kit,
but AFAIK there has been no work done in this direction yet. Mind filing a
JIRA ticket?

 sending the reserved and unreserved resources in two separate offers
indeed helps here
I would say this one also deserves a ticket. I may not see some use cases
where this is undesirable, but will be happy to see the discussion around
that documented in the ticket. Even if the ticket will end up in won't
fix, the discussion and reasoning can be helpful for posterity.

On Mon, Aug 17, 2015 at 3:46 PM, Gidon Gershinsky gi...@il.ibm.com wrote:

 Hi Alex,

 Yep, this setup is using static reservations in agents.

 I haven't tried running a big task with two or more resources (reserved
 and unreserved), but guess it is quite intuitive for a developer - a
 framework is offered two resource objects, and launches a task specifying
 these objects, no need to dive too deep into resource roles etc. If a
 framework hoards resources, it can sum up the offered objects, which
 again looks reasonable.
 The problem I had is at the opposite end - when a framework needs to split
 the offered resources and run many smaller tasks. Eventually, I was able to
 bypass it, by micro-managing the role assignment to each task resources;
 cumbersome, but works. So its more of a usage issue - if there were an api
 for splitting a resource object (opposite to the + api for
 summing/hoarding), the things would be more intuitive.
 Btw, sending the reserved and unreserved resources in two separate offers
 indeed helps here, since each offer comes with a single role.
 In any case, I agree it makes sense for a developer to be aware of the
 reservation policies.



 Regards,
 Gidon







 From:Alex Rukletsov a...@mesosphere.com
 To:user@mesos.apache.org
 Date:17/08/2015 01:02 PM
 Subject:Re: Launching tasks with reserved resources
 --



 Hi Gidon,

 just to make sure, you mean static reservations on mesos agents (via
 --resources flag) and not dynamic reservations, right?

 Let me first try to explain, why you get the TASK_ERROR message. The
 built-in allocator merges '*' and reserved resources, hinting master to
 create a single offer. However, as you mentioned before, validation fails,
 if you try to mix resources with different role, because the function
 responsible for validation checks whether task resources are contained in
 offered resources, which obviously includes role equality check. Here are
 some source code snippets:

 *https://github.com/apache/mesos/blob/master/src/master/validation.cpp#L449*
 https://github.com/apache/mesos/blob/master/src/master/validation.cpp#L449
 *https://github.com/apache/mesos/blob/master/src/common/resources.cpp#L598*
 https://github.com/apache/mesos/blob/master/src/common/resources.cpp#L598
 *https://github.com/apache/mesos/blob/master/src/common/resources.cpp#L244*
 https://github.com/apache/mesos/blob/master/src/common/resources.cpp#L244
 *https://github.com/apache/mesos/blob/master/src/common/resources.cpp#L197*
 https://github.com/apache/mesos/blob/master/src/common/resources.cpp#L197

 Maybe we should split reserved and unreserved resources into two offers?

 Now, to your second concern about whether we should disallow tasks using
 both '*' and 'role' resources. I see your point: if a framework is entitled
 to use reserved and unreserved resources, why not hoard them and launch a
 bigger task? I think it's fine, and you should be actually able to do it by
 explicitly specifying two different resource objects in the task launch
 message, one for '* resources and one for your role. Why cannot you just
 use your framework's role for both? Different roles may have different
 guarantees (quota, MESOS-1791), and while reserved resources may still be
 available for your framework, '* may become unavailable for you (in future
 Mesos releases or with custom allocators) leading to the whole task
 termination. By requiring two different objects in the task launch message
 we motivate the framework — i.e. framework writer — to be aware of
 different policies that may be attached to different roles. Does it make
 sense?

 —Alex

 On Thu, Aug 13, 2015 at 2:23 PM, Gidon Gershinsky *gi...@il.ibm.com*
 gi...@il.ibm.com wrote:
 I have a simple setup where a framework runs with a role, and some
 resources are reserved in cluster for that role.
 The resource offers arrive at the framework as a list of two resource
 sets: one general (cpus(*)), etc)  and one specific for the role
 (cpus(role1), etc).

 So far so good. If two tasks are launched, each with one of the two
 resources, things work.

 But problems start when I need to launch multiple smaller tasks (with a
 total resource consumption equal to the offered). I run this by creating
 resource objects, and attaching

Re: [VOTE] Release Apache Mesos 0.23.0 (rc1)

2015-07-06 Thread Alex Rukletsov
-1

Compilation error on Mac OS 10.10.4 with clang 3.5, which is supported
according to release notes.
More details: https://issues.apache.org/jira/browse/MESOS-2991

On Mon, Jul 6, 2015 at 11:55 AM, Jörg Schad jo...@mesosphere.io wrote:

 P.S. to my prior +1
 Tested on ubuntu-trusty-14.04 including docker.

 On Sun, Jul 5, 2015 at 6:44 PM, Jörg Schad jo...@mesosphere.io wrote:

 +1

 On Sun, Jul 5, 2015 at 4:36 PM, Nikolaos Ballas neXus 
 nikolaos.bal...@nexusgroup.com wrote:

  +1



  Sent from my Samsung device


  Original message 
 From: tommy xiao xia...@gmail.com
 Date: 05/07/2015 15:14 (GMT+01:00)
 To: user@mesos.apache.org
 Subject: Re: [VOTE] Release Apache Mesos 0.23.0 (rc1)

  +1

 2015-07-04 12:32 GMT+08:00 Weitao zhouwtl...@gmail.com:

  +1

 发自我的 iPhone

 在 2015年7月4日,09:41,Marco Massenzio ma...@mesosphere.io 写道:

   +1

  *Marco Massenzio*
 *Distributed Systems Engineer*

 On Fri, Jul 3, 2015 at 12:25 PM, Adam Bordelon a...@mesosphere.io
 wrote:

 Hello Mesos community,

 Please vote on releasing the following candidate as Apache Mesos
 0.23.0.

 0.23.0 includes the following:

 
  - Per-container network isolation
 - Upgraded minimum required compilers to GCC 4.8+ or clang 3.5+.
 - Dockerized slaves will properly recover Docker containers upon
 failover.

 as well as experimental support for:
  - Fetcher Caching
  - Revocable Resources
  - SSL encryption
  - Persistent Volumes
  - Dynamic Reservations

 The CHANGELOG for the release is available at:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.23.0-rc1

 

 The candidate for Mesos 0.23.0 release is available at:

 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc1/mesos-0.23.0.tar.gz

 The tag to be voted on is 0.23.0-rc1:

 https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.23.0-rc1

 The MD5 checksum of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc1/mesos-0.23.0.tar.gz.md5

 The signature of the tarball can be found at:

 https://dist.apache.org/repos/dist/dev/mesos/0.23.0-rc1/mesos-0.23.0.tar.gz.asc

 The PGP key used to sign the release is here:
 https://dist.apache.org/repos/dist/release/mesos/KEYS

 The JAR is up in Maven in a staging repository here:
 https://repository.apache.org/content/repositories/orgapachemesos-1056

 Please vote on releasing this package as Apache Mesos 0.23.0!

 The vote is open until Fri July 10th, 12:00 PDT 2015 and passes if a
 majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Mesos 0.23.0
 [ ] -1 Do not release this package because ...

 Thanks,
  -Adam-





  --
 Deshi Xiao
 Twitter: xds2000
 E-mail: xiaods(AT)gmail.com






Re: When do executors shutdown?

2015-06-30 Thread Alex Rukletsov
An executor is terminated by Mesos if it misbehaves (e.g. sends
TASK_STAGING updates or uses too much memory), killed by an
oversubscription QoSController, a framework shuts down, or a scheduler
sends a scheduler::Call::Shutdown request to Mesos. Note that an executor
may also fail or decide to commit suicide.

On Tue, Jun 30, 2015 at 12:38 PM, Hans van den Bogert hansbog...@gmail.com
wrote:

 Exactly what I needed to know, one follow-up question though:

 An executor is terminated by Mesos if it has no running tasks

 Does this mean there is some timeout? Or does the “parent” framework
 actively have to give a command to shutdown the executor? Because using
 Spark in fine-grained mode for example, I don’t see the executors getting
 shutdown, even though they might not have tasks for a while. (I am glad
 they don’t get killed without consent of Spark, because we would lose our
 in memory data).

 On 30 Jun 2015, at 12:32, Alex Rukletsov a...@mesosphere.com wrote:

 There are two types of tasks: (1) those that specify an executor and (2)
 those, that specify a command.

 When a task of ttype (1) arrives to a slave, the slave checks whether an
 executor with the same executorID already exists on this slave. If yes, the
 task is redirected to the executor; if not, then an executor instance is
 created. An executor is terminated by Mesos if it has no running tasks and
 all status updated for terminated tasks have been delivered.

 For tasks of type (2) a special executor (called MesosExecutor) is created
 for each task. When such task terminates or is killed, the corresponding
 executor shuts down as well.

 On Tue, Jun 30, 2015 at 12:08 PM, Hans van den Bogert 
 hansbog...@gmail.com wrote:

 I have difficulty understanding Mesos’ model.

 A framework can, for every accepted resource offer,  mention an executor
 besides the tasks descriptions it submits to Mesos. However does every use
 of offered resources, start a new executor? Thus for instance if the
 scenario occurs that two resource offers are used (shortly after each
 other),  which happen to be of the same slave, then are two executors
 started at one point? Or is the second batch of tasks given to the first
 started executor?

 I hope my question is clear, if not, let me know,

 Hans van den Bogert






Re: Setting minimum offer size

2015-06-30 Thread Alex Rukletsov
One option is to implement alternative behaviour in an allocator module.

On Tue, Jun 30, 2015 at 3:34 PM, Dharmesh Kakadia dhkaka...@gmail.com
wrote:

 Interesting.

 I agree, that dynamic reservation and optimistic offers will help mitigate
 the issue, but the resource fragmentation (and starvation due to that) is a
 more general problem. Predictive models can certainly aid the Mesos
 scheduler here. I think the filters in Mesos can be extended to add more
 general preferences like the offer size, execution/predictive model etc.
 For the Mesos scheduler, the user should be able to configure what all
 filters it recognizes while making offers, which will also make the effect
 on scalability limited,as far as I understand. Thoughts?

 Thanks,
 Dharmesh



 On Sun, Jun 28, 2015 at 7:29 PM, Alex Rukletsov a...@mesosphere.com
 wrote:

 Sharma,

 that's exactly what we plan to add to Mesos. Dynamic reservations will
 land in 0.23, the next step is to optimistically offer reserved but yet
 unused resources (we call them optimistic offers) to other framework as
 revocable. The alternative with one framework will of course work, but this
 implies having a general-purpose framework, that does some work that is
 better done by Mesos (which has more information and therefore can take
 better decisions).

 On Wed, Jun 24, 2015 at 11:54 PM, Sharma Podila spod...@netflix.com
 wrote:

 In a previous (more HPC like) system I worked on, the scheduler did
 advance reservation of resources, claiming bits and pieces it got and
 holding on until all were available. Say the last bit is expected to come
 in about 1 hour from now (and this needs job runtime estimation/knowledge),
 any short jobs are back filled on to the advance reserved resources that
 are sitting idle for an hour, to improve utilization. This was combined
 with weights and priority based job preemptions, sometimes 1GB jobs are
 higher priority than the 1GB job. Unfortunately, that technique doesn't
 lend itself natively onto Mesos based scheduling.

 One idea that may work in Mesos is (thinking aloud):

 - The large (20GB) framework reserves 20 GB on some number of slaves (I
 am referring to dynamic reservations here, which aren't available yet)
 - The small framework continues to use up 1GB offers.
 - When the large framework needs to run a job, it will have the 20 GB
 offers since it has the reservation.
 - When the large framework does not have any jobs running on it, the
 small framework may be given those resources, but, those jobs will have to
 be preempted in order to offer 20 GB to the large framework.

 I understand this idea has some forward looking expectations on how
 dynamic reservations would/could work. Caveat: I haven't involved myself
 closely with that feature definition, so could be wrong with my
 expectations.

 Until something like that lands, the existing static reservations, of
 course, should work. But, that reduces utilization drastically if the large
 framework runs jobs sporadically.

 Another idea is to have one framework schedule both the 20GB jobs and
 1GB jobs. Within the framework, it can bin pack the 1GB jobs on to as small
 a number of slaves as possible. This increases the likelihood of finding
 20GB on a slave. Combining that with preemptions from within the framework
 (a simple kill of certain number of 1GB jobs) should satisfy the 20 GB jobs.



 On Wed, Jun 24, 2015 at 9:26 AM, Tim St Clair tstcl...@redhat.com
 wrote:



 - Original Message -
  From: Brian Candler b.cand...@pobox.com
  To: user@mesos.apache.org
  Sent: Wednesday, June 24, 2015 10:50:43 AM
  Subject: Re: Setting minimum offer size
 
  On 24/06/2015 16:31, Alex Gaudio wrote:
   Does anyone have other ideas?
  HTCondor deals with this by having a defrag demon, which
 periodically
  stops hosts accepting small jobs, so that it can coalesce small slots
  into larger ones.
 
 
 http://research.cs.wisc.edu/htcondor/manual/latest/3_5Policy_Configuration.html#sec:SMP-defrag
 

 Yuppers, and guess who helped work on it ;-)

  You can configure policies based on how many drained machines are
  already available, and how many can be draining at once.
 

 It had to be done this way, as there was only so much sophistication
 you can put into scheduling before you start to add latency.

  Maybe there would be a benefit if Mesos could work out what is the
  largest job any framework has waiting to run, so it knows whether
  draining is required and how far to drain down.  This might take the
  form of a message to the framework: suppose I offered you all the
  resources on the cluster, what is the largest single job you would
 want
  to run, and which machine(s) could it run on?  Or something like
 that.
 
  Regards,
 
  Brian.
 
 

 --
 Cheers,
 Timothy St. Clair
 Red Hat Inc.







Re: Setting minimum offer size

2015-06-28 Thread Alex Rukletsov
Sharma,

that's exactly what we plan to add to Mesos. Dynamic reservations will land
in 0.23, the next step is to optimistically offer reserved but yet unused
resources (we call them optimistic offers) to other framework as revocable.
The alternative with one framework will of course work, but this implies
having a general-purpose framework, that does some work that is better done
by Mesos (which has more information and therefore can take better
decisions).

On Wed, Jun 24, 2015 at 11:54 PM, Sharma Podila spod...@netflix.com wrote:

 In a previous (more HPC like) system I worked on, the scheduler did
 advance reservation of resources, claiming bits and pieces it got and
 holding on until all were available. Say the last bit is expected to come
 in about 1 hour from now (and this needs job runtime estimation/knowledge),
 any short jobs are back filled on to the advance reserved resources that
 are sitting idle for an hour, to improve utilization. This was combined
 with weights and priority based job preemptions, sometimes 1GB jobs are
 higher priority than the 1GB job. Unfortunately, that technique doesn't
 lend itself natively onto Mesos based scheduling.

 One idea that may work in Mesos is (thinking aloud):

 - The large (20GB) framework reserves 20 GB on some number of slaves (I am
 referring to dynamic reservations here, which aren't available yet)
 - The small framework continues to use up 1GB offers.
 - When the large framework needs to run a job, it will have the 20 GB
 offers since it has the reservation.
 - When the large framework does not have any jobs running on it, the small
 framework may be given those resources, but, those jobs will have to be
 preempted in order to offer 20 GB to the large framework.

 I understand this idea has some forward looking expectations on how
 dynamic reservations would/could work. Caveat: I haven't involved myself
 closely with that feature definition, so could be wrong with my
 expectations.

 Until something like that lands, the existing static reservations, of
 course, should work. But, that reduces utilization drastically if the large
 framework runs jobs sporadically.

 Another idea is to have one framework schedule both the 20GB jobs and 1GB
 jobs. Within the framework, it can bin pack the 1GB jobs on to as small a
 number of slaves as possible. This increases the likelihood of finding 20GB
 on a slave. Combining that with preemptions from within the framework (a
 simple kill of certain number of 1GB jobs) should satisfy the 20 GB jobs.



 On Wed, Jun 24, 2015 at 9:26 AM, Tim St Clair tstcl...@redhat.com wrote:



 - Original Message -
  From: Brian Candler b.cand...@pobox.com
  To: user@mesos.apache.org
  Sent: Wednesday, June 24, 2015 10:50:43 AM
  Subject: Re: Setting minimum offer size
 
  On 24/06/2015 16:31, Alex Gaudio wrote:
   Does anyone have other ideas?
  HTCondor deals with this by having a defrag demon, which periodically
  stops hosts accepting small jobs, so that it can coalesce small slots
  into larger ones.
 
 
 http://research.cs.wisc.edu/htcondor/manual/latest/3_5Policy_Configuration.html#sec:SMP-defrag
 

 Yuppers, and guess who helped work on it ;-)

  You can configure policies based on how many drained machines are
  already available, and how many can be draining at once.
 

 It had to be done this way, as there was only so much sophistication you
 can put into scheduling before you start to add latency.

  Maybe there would be a benefit if Mesos could work out what is the
  largest job any framework has waiting to run, so it knows whether
  draining is required and how far to drain down.  This might take the
  form of a message to the framework: suppose I offered you all the
  resources on the cluster, what is the largest single job you would want
  to run, and which machine(s) could it run on?  Or something like that.
 
  Regards,
 
  Brian.
 
 

 --
 Cheers,
 Timothy St. Clair
 Red Hat Inc.





Re: what reason caused the high cached memory by rcuos process

2015-06-22 Thread Alex Rukletsov
Tommy,

I'm not sure it's Mesos related, looks like you kernel is configured in a
way that requires the RCU-related processes to run. If you kill mesos-slave
process, are the rcuos* processes gone?

On Wed, Jun 3, 2015 at 12:31 PM, tommy xiao xia...@gmail.com wrote:

  Hi Alex,

 My concern is what issue spawn many more rcuos process. i only running
 mesos-slave instance.

 2015-05-26 23:50 GMT+08:00 Alex Rukletsov a...@mesosphere.com:

 What exactly is your concern?

 On Mon, May 25, 2015 at 2:45 PM, tommy xiao xia...@gmail.com wrote:

 Today i setup a testing cluster in azure Cloud. the slave node only run
 a mesos slave daemon. Sometime i found the slave host got a
 weird circumstance. the memory got more cached resource. But it don't know
 what reason caused it. can anyone do me a favor? attached png is screenshot
 for the testing ground.

 --
 Deshi Xiao
 Twitter: xds2000
 E-mail: xiaods(AT)gmail.com





 --
 Deshi Xiao
 Twitter: xds2000
 E-mail: xiaods(AT)gmail.com



Re: Resource modelling questions

2015-06-19 Thread Alex Rukletsov
Inlined.

On Fri, Jun 19, 2015 at 4:17 AM, zhou weitao zhouwtl...@gmail.com wrote:

 Alex, hi,

 2015-06-18 23:25 GMT+08:00 Alex Rukletsov a...@mesosphere.com:

 Zhou,

 I haven't read the *Design* yet, but I don't think it is solving the same
 question between priorities and quota. For example, assume we only have 10G
 memory reservating for framework A totally, then another urgency framework
 is getting nothing. which is statical partition still. While priority can
 pre-empt that.


 I'm not sure what is your concern here. If all you have is 10GB and both
 your frameworks A and B may need 10GB each at the same time, you definitely
 need to add more RAM : ). Mesos uses fair sharing for distributing
 resources among frameworks, quota will not be an exception. If all you have
 is 10GB and for both A and B you have reserved 10GB, total reserved
 resources are 20GB, which means your cluster is under quota. I would say,
 if this is happening in production cluster, several devops engineers should
 have already been paged : ).

 It's up to allocator implementation to decide what to do in this
 situation. An obvious approach is to throttle (i.e. revoke resources) both
 frameworks proportionally to their role weights. Quota does not introduce
 static partitioning, it rather guarantees, a production framework gets
 enough resources regardless of any events happening in the cluster, given
 these resources are available.


 I am sorry for my poor English firstly. In a nutshell, what I pointed is
 that it is different between priorities and mesos quota. Here I am confused
 about the following 2 question:

No worries, your English is good enough to understand what you mean.


 1. The *quota* will be always reserved for some framework in the cluster?
 And other frameworks are all forbidden to use it?

The quota is being designed right now and the draft has not been published
yet. Our current idea is to set quota per role and therefore leverage role
weights. Other framework will be able to use resources from unused quotas,
but will have to release the resources at the instance they are required by
the framework, which owns them.


 2. I am figuring out such scenario, as the above Brian said, I have 2
 framework: Framework A is responsible for Real-time job, high priority but
 low frequency of use, while framework B is responsible for offline job,
 always running, greedy for resource, but low priority. Then, A and B are
 using the same mesos cluster, how can I config to let B use all of the
 resource till A pre-empt some of that?

An operator sets up quota for A, which is unused most of the time and B
will be offered resources marked as revocable from A's quota. If A needs to
launch more tasks, B tasks will be preempted.




 thanks a lot.

 weitao Zhou




Re: Resource modelling questions

2015-06-18 Thread Alex Rukletsov
Zhou,

I haven't read the *Design* yet, but I don't think it is solving the same
 question between priorities and quota. For example, assume we only have 10G
 memory reservating for framework A totally, then another urgency framework
 is getting nothing. which is statical partition still. While priority can
 pre-empt that.


I'm not sure what is your concern here. If all you have is 10GB and both
your frameworks A and B may need 10GB each at the same time, you definitely
need to add more RAM : ). Mesos uses fair sharing for distributing
resources among frameworks, quota will not be an exception. If all you have
is 10GB and for both A and B you have reserved 10GB, total reserved
resources are 20GB, which means your cluster is under quota. I would say,
if this is happening in production cluster, several devops engineers should
have already been paged : ).

It's up to allocator implementation to decide what to do in this situation.
An obvious approach is to throttle (i.e. revoke resources) both frameworks
proportionally to their role weights. Quota does not introduce static
partitioning, it rather guarantees, a production framework gets enough
resources regardless of any events happening in the cluster, given these
resources are available.


Re: mesosphere.io broken?

2015-06-17 Thread Alex Rukletsov
For downloads, use https://mesosphere.com/downloads/
Elastic Mesos has been decommissioned, use https://google.mesosphere.com/
or https://digitalocean.mesosphere.com/ but keep in mind they will be
decommissioned soon (~1 month) as well. However, if you want to try DCOS
installation on AWS, check https://mesosphere.com/product/

On Wed, Jun 17, 2015 at 12:51 PM, Brian Candler b.cand...@pobox.com wrote:

 Looking for Mesos .deb packages, on Google I find links to
 http://mesosphere.io/downloads/
 http://elastic.mesosphere.io/
 but these are giving 503 Service Unavailable errors.

 Is there a problem, or have these sites gone / migrated away?




Re: Setting Rate of Resource Offers

2015-06-14 Thread Alex Rukletsov
Christopher,

try adjusting master allocation_interval flag. It specifies often the
allocator performs batch allocations to frameworks. As Ondrej pointed out,
if you framework explicitly declines offers, it won't be re-offered the
same resources for some period of time.

On Sat, Jun 13, 2015 at 8:30 PM, Ondrej Smola ondrej.sm...@gmail.com
wrote:

 Hi Christopher,

 i dont know about any way way how to speed up first resource offer -
 in my experience new offers arrive almost immediately after framework
 registration. It depends on the infrastructure you are testing your
 framework on - are there any
 other frameworks running? As is discussed in an another thread offers
 should be send to multiple frameworks at once. There may be small
 delay based on initial registration and network delay. If you speak
 about reoffers - reoffering
 decline offers - there should param to set interval for reoffer. For
 example in Go you can decline offer this way (it is also important to
 decline every non used offer):

 driver.DeclineOffer(offer.Id, mesos.Filters{RefuseSeconds:
 proto.Float64(5)})

 Look to mesos UI - it shoud give you information abou what offers are
 offered to which frameworks, mesos master logs also give you this
 information.


 2015-06-13 18:23 GMT+02:00 Christopher Ketchum cketc...@ucsc.edu:
  Hi,
 
  I was wondering if there was any way to adjust the rate of resource
 offers to the framework. I am writing a mesos framework, and when I am
 testing it I am noticing a slight pause were the framework seems to be
 waiting for another resource offer. I would like to know if there is any
 way to speed these offers up, just to make testing a little faster.
 
  Thanks,
  Chris



Re: Can Mesos master offer resources to multiple frameworks simultaneously?

2015-06-10 Thread Alex Rukletsov
I'll try to answer these questions.

1. Currently, the only language you can use is C++. You can workaround this
by writing a proxy in c++ that delegates the calls to, say, python scripts.
See http://mesos.apache.org/documentation/latest/allocation-module/ for
more details.

2. The default allocator is called dominant resource fairness since it
tries to distribute resources fairly between active frameworks. This means
it will offer all available resources to all frameworks, but each framework
will get only a certain share. For more information I encourage you to take
a look at the DRF paper.

3. Offered and not declined resources are considered to be used, therefore
they can't be re-offered until freed.

Hope this helps.
On 10 Jun 2015 7:53 am, Qian Zhang zhq527...@gmail.com wrote:

 Thanks Adam, this is very helpful!

 I have a few more questions:
 1. For the pluggable allocator modules, can I write my own allocator in
 any programming language (e.g., Python, Go, etc)?
 2. For the default DRF allocator, when it offer resources to a framework,
 will it offer all the available resources (resources not being used by any
 frameworks) to it? Or just part of the available resources?
 3. If there are multiple frameworks and the default DRF allocator will
 only offer resources to a single framework at a time, then that means
 framework 2 has to wait for framework 1 until framework 1 makes its
 placement decision?





Re: 答复: [DISCUSS] Renaming Mesos Slave

2015-06-08 Thread Alex Rukletsov
While I'm apathetic to changing the name, I think we should do more than
just voting on an alternate name in case we decide to proceed and replace
the master/slave terminology. Such change is very expensive and it makes
sense to do it once than to rush and pick up an ambiguous term. If we make
this step, we can use it as an opportunity to choose a *better* name for
key Mesos components.

My suggestion is to add pros and cons to every name put in for voting.
Let's back up each proposal with meaningful explanation why this proposal
should be preferred over others. I'll give an example (I will stick to the
current terminology for clarity):
* -1 for 'worker' as it implies the slave process does the actual work,
which is not true and misleading.
* -1 for 'leader/follower' as mesos slaves do not really *follow* the mesos
master; can be confused with leading/shadowing master(s).
* +1 for disambiguating between mesos slave process and mesos slave node:
fwiw, multiple slave processes can be running on the same node.

Some time ago we had an offline discussion about whether master and slave
should actually be different entities. Having a single entity, say,
mesos-agent, that can act either as slave or as master can be beneficial.
Though this is outside of the scope of the current thread, I would like to
keep it in mind and be as general as possible while choosing the name.

Hence, my favourites so far are:
1. Mesos Node [can be disambiguated as Mesos Master Node or Mesos Agent
Node]
2. Mesos Agent
3. No [Mesos Master can mean a particular mode in which a Mesos Agent
currently operates]
4. Start using it in presentations, JIRAs, mailing lists, then proceed to
docs update; change code via deprecation process once new terminology is
settled.


On Mon, Jun 8, 2015 at 10:12 AM, Aaron Carey aca...@ilm.com wrote:

  I've been following this thread with interest, it draws a lot of
 parallels with similar problems my wife faces as a teacher (and I imagine
 this happens in other government/public sector organisations, earlier in
 this thread James pointed me to an interested Wikipedia article which
 suggested this also happens occasionally in software: eg County of Los
 Angeles in 2003). Every few years teachers are told to change the words
 used to describe various things related to kids with minority backgrounds,
 from underprivileged families or with disabilities and so on, usually to
 stop other children from using them as derogatory terms or insults. It
 works for a while and then the pupils catch on and start using the new
 words and the cycle repeats.

 I guess the point I'm trying to make here is that if you do decide to
 change the naming of master/slave because some naughty programmers in the
 community have been using the terms offensively, you better make damn sure
 you choose new terms which aren't likely to cause offence in the future and
 require the whole renaming process to run again. Which is why I'm voting
 for:

 +1 Gru/Minion

 There could also be another option: These terms are all being used to
 describe a master/slave relationship, the mesos master is in charge, it
 assigns work to the slaves and ensures that they carry it out. I'd suggest
 that whatever you call this pair, the relationship will always be one of
 domination and servitude. Perhaps what is really needed here is to get rid
 of the concept of a master altogether and re-architect mesos so all nodes
 in the cluster are equal and reach a consensus together about work
 distribution and so on?


  --
 *From:* Nikolay Borodachev [nbo...@adobe.com]
 *Sent:* 06 June 2015 04:34
 *To:* user@mesos.apache.org
 *Subject:* RE: 答复: [DISCUSS] Renaming Mesos Slave

   +1 master/slave – no need to change



 *From:* Sam Salisbury [mailto:samsalisb...@gmail.com]
 *Sent:* Friday, June 05, 2015 8:31 AM
 *To:* user@mesos.apache.org
 *Subject:* Re: 答复: [DISCUSS] Renaming Mesos Slave



 Master/Minion +1



 On 5 June 2015 at 15:14, CCAAT cc...@tampabay.rr.com wrote:


 +1 master/slave, no change needed.  is the same as
 master/slaveI.E. keep the nomenclature as it currently is

 This means keep the name 'master' and keep the name 'slave'.


 Are you applying fuzzy math or kalman filters to your summations below?

 It looks to me, tallying things up, Master is kept as it is
 and 'Slave' is kept as it is. There did not seem to be any consensus
 on the new names if the pair names are updated. Or you can vote separately
 on each name? On an  real ballot, you enter the choices,
 vote according to your needs, tally the results and publish them.
 Applying a 'fuzzy filter' to what has occurred in this debate so far
 is ridiculous.

 Why not repost the question like this or something on a more fair
 voting preference:

 
 Please vote for your favourite Name-pair in Mesos, for what is currently
 Master-Slave. Note Master-Slave is the no change vote option.

 [] Master-Slave
 [] Mesos-Slave
 [] Mesos-Minion
 [] Master-Minion

Re: Restarting mesos-slave in a node restarts all the apps

2015-05-29 Thread Alex Rukletsov
Siva,

yes, this is intended behaviour: keep tasks running and give the Mesos
Worker some time to re-register. You can adjust this timeout via
--slave_reregister_timeout, but keep in mind 10 min is the minimum.

On Fri, May 29, 2015 at 8:04 AM, Sivaram Kannan sivara...@gmail.com wrote:


 Hi ,

 If I restart a mesos-slave service in a node, does it brings down the apps
 running in those nodes. The apps are coming up again after mesos-slave
 service is up, but wanted to confirm whether that is the expected behaviour.

 Thanks,
 ./Siva.



Re: Reminder: /stats.json is deprecated

2015-05-20 Thread Alex Rukletsov
Reminder: please don't forget to update your code to use
/metrics/snapshot endpoint
instead of deprecated /stats.json prior Mesos 0.23 release.

On Wed, Apr 8, 2015 at 1:07 PM, Alex Rukletsov a...@mesosphere.com wrote:

 Folks,

 if you build tooling around Mesos, please be advised that in current 0.22
 Mesos release /stats.json endpoint is deprecated in favour of
 /metrics/snapshot and will be removed in Mesos 0.23 release. If you rely
 on /stats.json, please update your code to use the new endpoint instead.

 Related JIRA: https://issues.apache.org/jira/browse/MESOS-2058
 Commit: d9ba9199a8c8357ab13a1b14f8ee63409c5ac310




Re: Changing Mesos Minimum Compiler Version

2015-04-21 Thread Alex Rukletsov
Folks, let's summarize and move on here.

Proposal out on April 9, 2015. Current status (as of April 21, 2015):


+1 (Binding)
--
Vinod Kone
Timothy Chen
Yan Xu
Brenden Matthews

+1 (Non-binding)
--
Cody Maloney
Joris Van Remoortere
Jeff Schroeder
Jörg Schad
Elizabeth Lingg
Alexander Rojas
Alex Rukletsov
Michael Park
Haosdent Huang
Bernd Mathiske

0 (Non-binding)
--
Nikolaos Ballas

There were no -1 votes.

Cody, let's convert MESOS-2604 to an epic and bump the version in 0.23.

Thanks,
Alex


On Mon, Apr 13, 2015 at 12:46 PM, Bernd Mathiske be...@mesosphere.io
wrote:

 +1

  On Apr 10, 2015, at 6:02 PM, Michael Park mcyp...@gmail.com wrote:
 
  +1
 
  On 9 April 2015 at 17:33, Alexander Gallego agall...@concord.io wrote:
 
  This is amazing for native devs/frameworks.
 
  Sent from my iPhone
 
  On Apr 9, 2015, at 5:16 PM, Joris Van Remoortere jo...@mesosphere.io
  wrote:
 
  +1
 
  On Thu, Apr 9, 2015 at 2:14 PM, Cody Maloney c...@mesosphere.io
  wrote:
  As discussed in the last community meeting, we'd like to bump the
  minimum required compiler version from GCC 4.4 to GCC 4.8.
 
  The overall goals are to make Mesos development safer, faster, and
  reduce the maintenance burden. Currently a lot of stout has different
  codepaths for Pre-C++11 and Post-C++11compilers.
 
  Progress will be tracked in the JIRA: MESOS-2604
 
  The resulting supported compiler versions will be:
  GCC 4.8, GCC 4.9
  Clang 3.5, Clang 3.6
 
  For reference
  Compilers by Distribution Version: http://goo.gl/p1t1ls
 
  C++11 features supported by each compiler:
  https://gcc.gnu.org/projects/cxx0x.html
  http://clang.llvm.org/cxx_status.html
 
 




  1   2   >