Re: [VOTE] Move Apache Mesos to Attic

2021-04-08 Thread Benjamin Bannier
Hi Shane,

>> FWIW, I'm one of those people who said they were interested, and I
>> still voted to move it to the attic (even though my vote is non
>> binding as I'm not a committer).
>
>That's great!  One of the questions I have for the project is: why
>haven't they made you a committer yet?  (That's a question for the PMC,
>not for you, really, but it's one I'm betting the Board would be curious
>to hear about).

I fear the problem is less that contributors haven't been voted in as
committers, but IMO that unfortunately the project has seen hardly any
contributions of bigger patches from the community in the last couple of
years, (I know we merged patches from Charles, but when some people
voiced interested here that's the first at least I heard of them). This
makes me pessimistic regarding a handover which medium term can improve
the situation.

I summarized some issues I saw in an earlier thread (yes, this discussion
has been going on since Feb 2021), and I think the issues were only if
at all marginally related to committership,
https://lists.apache.org/thread.html/r4bccbf048a9bcde3f0bb66d5e2c57f585296e1f5e2769486413b2758%40%3Cdev.mesos.apache.org%3E
.


Cheers,

Benjamin


Re: [VOTE] Move Apache Mesos to Attic

2021-04-05 Thread Benjamin Bannier
With a heavy heart, but also curiosity about what will come next, +1.


Benjamin

On Mon, Apr 5, 2021 at 7:58 PM Vinod Kone  wrote:

> Hi folks,
>
> Based on the recent conversations
> 
> on our mailing list, it seems to me that the majority consensus among the
> existing PMC is to move the project to the attic
>  and let the interested community members
> collaborate on a fork in Github.
>
> I would like to call a vote to dissolve the PMC and move the project to
> the attic.
>
> Please reply to this thread with your vote. Only binding votes from
> PMC/committers count towards the final tally but everyone in the community
> is encouraged to vote. See process here
> .
>
> Thanks,
>


Re: Feature requests for Mesos

2021-03-01 Thread Benjamin Bannier
Hi Charles-François,

thanks for your detailed message, you captured important points, and I
think I agree with your sentiment here. Mesos might still have a place, and
before thinking about what new features to add, the project first needs to
solve more fundamental issues.

My previous pessimistic assessment on this list came from a similar angle
but I think with wider scope: a healthy project requires a healthy
community where users can find help, but also can have some hope that
important issues will get fixed. I have not been able to spend much time on
Mesos in the last year, but was following Slack and the mailing lists (the
ones with humans and the ones with bots). On the mailing lists I see users
ask for help with issues they run into or questions, but only rarely will
get a response from committers or other community members. Few new JIRA
issues were filed in the since fall 2020, but hardly any of them have been
triaged let alone fixed (this is on top of the existing bug backlog). I do
not think one needs to be a committer to improve on that situation if one
can get help getting patches discussed, reviewed and ultimately merged. It
looks like Andrei and Qian have committed to help on the latter, but I have
only rarely seen community members volunteer for the former.

When I wrote that I thought starting a new project on top of Apache Mesos
today might be not a good idea, I mainly came from that angle. While the
software does work for many use cases it seems to be unmaintained with
hardly any folks active in taking it further globally, beyond their own
immediate needs, and willing to take on the needed work. Being a top-level
Apache project with a strong history, Apache Mesos still has a brand, but I
don't think it has lived up to the associated expectations. Similarly, big
ownership gaps (technical and project-wise) have developed which neither
active committers nor community members have filled. Again, one would not
need to be a committer to develop expertise and contribute, and actually
the natural and historic process was for folks to do exactly that with
committership being a thing only after getting involved (see
https://community.apache.org/newcommitter.html for Apache's high-level view
on that). This is the issue of continued trust Renan mentioned in their
message to the user mailing list which I also believe is critical so the
project can live up to its promise (this is integral to being an Apache
project, see e.g., https://www.apache.org/theapacheway).

As a non-user with emotional attachment to the historic Apache Mesos brand,
my list of areas in need of improvement to resurrect this project would be:

- willingness of remaining active committers to be active on a regular
basis in engagements with the community, both on the user and contributor
side (in PRs, review requests, on mailing lists),
- transparent and active discussions in the community, among committers and
contributors, and among committers, in applicable form, beyond roll calls,
- timely and consistent process to address user issues, and
- consistent ownership of the bug and feature backlog.

Note that work on new feature requests is absent from my list. That folks
want to discuss that here and now seems to me to be another sign that the
Mesos community is not in a good place given all its existing non-technical
issues.


Best,

Benjamin


Re: Next Steps

2021-02-18 Thread Benjamin Bannier
Hi Vinod,

> I would like to start a discussion around the future of the Mesos project.
>
> As you are probably aware, the number of active committers and
contributors
> to the project have declined significantly over time. As of today, there's
> no active development of any features or a public release planned. On the
> flip side, I do know there are a few companies who are still actively
using
> Mesos.

Thanks for starting this discussion Vinod. Looking at Slack, mailing
lists, JIRA and reviewboard/github the project has wound down a lot in
the last 12+ months.

> Given that, we need to assess if there's interest in the community to keep
> this project moving forward. Specifically, we need some active committers
> and PMC members who are going to manage the project. Ideally, these would
> be people who are using Mesos in some capacity and can make code
> contributions.

While I have seen a few non-committer folks contribute patches in the
last months, I feel it might be too late to bootstrap an active
community at this point.

Apache Mesos is still mentioned prominently in the docs of a number of
other projects which gives off the impression of an active and
maintained project. In reality almost nobody is working on issues or
available to help users, and basing a new project on Apache Mesos these
days is probably not a good idea. I honestly do not see that to change
should new people step up and IMO the most honest way forward would be
to move the project to the attic to clearly communicate that the project
has moved into another phase; this wouldn't preclude folks from using or
further developing Apache Mesos, but would give a clear signal to users.

> If there is no active interest, we will likely need to figure out steps
for
> retiring the project.
>
> *Call for action: If you are interested in becoming a committer/PMC member
> (including PMC chair) and actively maintain the project, please reply to
> this email.*

Like I wrote above, I would be in favor of a vote to move Apache Mesos
to the attic.


Cheers,

Benjamin


RFC: Extending supported RESERVE operations

2019-09-30 Thread Benjamin Bannier
Hi,

Mesos currently puts a number of restrictions on what a RESERVE operation can 
do (e.g., add only one refinement; no support to change a resource 
reservations) which implies restrictions elsewhere, e.g., on persistent 
volumes. In order to make reservations more flexible we came up with a design 
to support "re-reserving” (modifying a resource’s reservation role) which also 
seems to enable a number of other use cases.

The current design doc is 
https://docs.google.com/document/d/1LFh0OkOEHslmK6xqok1fCn2MOqGefvNodusOOnV66Q4/.


Cheers,

Benjamin

Re: RFC: Improving linting in Mesos (MESOS-9630)

2019-09-18 Thread Benjamin Bannier
Hello again,

I have landed the patches for MESOS-9630 on the `master` branch, so we now
use pre-commit as linting framework.

pre-commit primer
=

0. Install pre-commit, https://pre-commit.com/#install.

1. Run `./support/setup-dev.sh` to install hooks. We have broken
developer-related setup out of `./bootstrap` which by now only bootstraps
the autotools project while `support/setup-dev.sh` sets up developer
configuration files and git hooks.

2. As git hooks are global to a checkout and not tied to branches, you
might run into issues with the linter setup on older branches since
configuration files or scripts might not be present. You either should
setup that branch's linters with e.g., `./bootstrap`, or could silence
warnings from the missing linter setup with e.g.,

   $ PRE_COMMIT_ALLOW_NO_CONFIG=1 git commit

3. You can use the `SKIP` environment variable to disable certain linters,
e.g.,

   # git-revert(1) often produces titles longer than 72 characters.
   $ SKIP=gitlint git revert HEAD

   `SKIP` takes a linter `id` which you can look up in
`.pre-commit-config.yaml`.

4. We still use git hooks, but To explicitly lint your staged changes
before a commit execute

   # Run all applicable linters,
   $ pre-commit

   # or a certain linter, e.g., `cpplint`.
   $ pre-commit run cpplint

   pre-commit runs only on staged changes.

5. To run a full linting of the whole codebase execute

   $ SKIP=split pre-commit run -a

   We need to skip the `split` linter as it would complain about a mix of
files from stout, libprocess, and Mesos proper (it could be rewritten to
lift this preexisting condition).

6. pre-commit caches linter environments in `$XDG_CACHE_HOME/.pre-commit`
where `XDG_CACHE_HOME` is most often `$HOME/.cache`. While pre-commit
automatically sets up linter environments, cleanup is manual

   # gc unused linter environments, e.g., after linter updates.
   $ pre-commit gc

   # Remove all cached environments.
   $ pre-commit clean

7. To make changes to your local linting setup replace the symlink
`.pre-commit-config.yaml` with a copy of `support/pre-commit-config.yaml`
and adjust as needed. pre-commit maintains a listing of hooks of varying
quality, https://pre-commit.com/hooks.html and other linters can be added
pretty easily (see e.g., the `local` linters `split`, `license`, and
`cpplint` in our setup). Consider upstreaming whatever you found useful.



Happy linting,

Benjamin

On Sat, Aug 17, 2019 at 2:12 PM Benjamin Bannier 
wrote:

> Hi,
>
> I opened MESOS-9360[^1] to improve the way we do linting in Mesos some time
> ago. I have put some polish on my private setup and now published it, and
> am
> asking for feedback as linting is an important part of working with Mesos
> for
> most of you. I have moved my workflow to pre-commit more than 6 months ago
> and
> prefer it so much that I will not go back to `support/mesos-style.py`.
>
> * * *
>
> We use `support/mesos-style.py` to perform linting, most often triggered
> automatically when committing. This setup is powerful, but also hard to
> maintain and extend. pre-commit[^2] is a framework for managing Git commit
> hooks which has an exciting set of features, one can often enough
> configure it
> only with YAML and comes with a long list of existing linters[^3]. Should
> we
> go with this approach we could e.g., trivially enable linters for Markdown
> or
> HTML (after fixing the current, sometimes wild state of the sources).
>
> I would encourage you to play with the [chain] ending in r/71300[^4] on
> some
> fresh clone (as this modifies your Git hooks). You need to install
> pre-commit[^5] _before applying the chain_, and then run
> `support/setup_dev.sh`. This setup mirrors the existing functionality of
> `support/mesos-style.py`, but also has new linters activated. This should
> present a pretty streamlined workflow. I have also adjusted the Windows
> setup,
> but not tested it.
>
> I have also spent some time to make transitioning from our current linting
> setup easier. If you are feeling adventurous you can apply the chain up to
> r/71209/ on your existing setup and run `support/setup_dev.sh`.
>
> One noticeable change is that with pre-commit we will store (some) linters
> in
> `$XDG_CACHE_HOME` (default: `$HOME/.cache`). The existing setup stores some
> linter files in the build directory, so a "clean build" might require
> downloading linter files again. With pre-commit OTOH one needs to perform
> garbage-collection out of band (e.g., by executing `pre-commit gc`, or
> deleting
> the cache directory).
>
> * * *
>
> Please let me know whether we should move forward with this change, you
> think
> it needs important adjustments, or you see fundamental reasons that this
> is a
> bad idea. If you l

RFC: Improving linting in Mesos (MESOS-9630)

2019-08-17 Thread Benjamin Bannier
Hi,

I opened MESOS-9360[^1] to improve the way we do linting in Mesos some time
ago. I have put some polish on my private setup and now published it, and am
asking for feedback as linting is an important part of working with Mesos
for
most of you. I have moved my workflow to pre-commit more than 6 months ago
and
prefer it so much that I will not go back to `support/mesos-style.py`.

* * *

We use `support/mesos-style.py` to perform linting, most often triggered
automatically when committing. This setup is powerful, but also hard to
maintain and extend. pre-commit[^2] is a framework for managing Git commit
hooks which has an exciting set of features, one can often enough configure
it
only with YAML and comes with a long list of existing linters[^3]. Should we
go with this approach we could e.g., trivially enable linters for Markdown
or
HTML (after fixing the current, sometimes wild state of the sources).

I would encourage you to play with the [chain] ending in r/71300[^4] on some
fresh clone (as this modifies your Git hooks). You need to install
pre-commit[^5] _before applying the chain_, and then run
`support/setup_dev.sh`. This setup mirrors the existing functionality of
`support/mesos-style.py`, but also has new linters activated. This should
present a pretty streamlined workflow. I have also adjusted the Windows
setup,
but not tested it.

I have also spent some time to make transitioning from our current linting
setup easier. If you are feeling adventurous you can apply the chain up to
r/71209/ on your existing setup and run `support/setup_dev.sh`.

One noticeable change is that with pre-commit we will store (some) linters
in
`$XDG_CACHE_HOME` (default: `$HOME/.cache`). The existing setup stores some
linter files in the build directory, so a "clean build" might require
downloading linter files again. With pre-commit OTOH one needs to perform
garbage-collection out of band (e.g., by executing `pre-commit gc`, or
deleting
the cache directory).

* * *

Please let me know whether we should move forward with this change, you
think
it needs important adjustments, or you see fundamental reasons that this is
a
bad idea. If you like what you see here I would be happy to know about that
as
well.


Cheers,

Benjamin


[^1]: https://issues.apache.org/jira/browse/MESOS-9630
[^2]: https://pre-commit.com/
[^3]: https://pre-commit.com/hooks.html
[^4]: https://reviews.apache.org/r/71300/
[^5]: https://pre-commit.com/#install
[^6]: https://reviews.apache.org/r/71209


Re: Why does not mesos provide linux packages ?

2019-04-03 Thread Benjamin Bannier
Hi,

> why don't we have packages for the main ubuntu distributions ? like ubuntu 
> and redhat ? 

Just reiterating my comment from 
https://issues.apache.org/jira/browse/MESOS-6851?focusedCommentId=16808547=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16808547
 here for posterity.

The Mesos community provides RPM packages; Mesosphere also provides Debian 
packages and any help moving these into Apache Mesos is likely welcome.


Cheers,

Benjamin

Re: Discussion: Scheduler API for Operation Reconciliation

2019-01-16 Thread Benjamin Bannier
Hi,

have we reached a conclusion here?

From the Mesos side of things I would be strongly in favor of proposal (III). 
This is not only consistent with what we do with task status updates, but also 
would allow us to provide improved operation status (e.g., 
`OPERATION_UNREACHABLE` instead of just `OPERATION_UNKNOWN` to better 
distinguish non-terminal from terminal operation states. To accomplish that we 
wouldn’t need to introduce extra information leakage (e.g., explicitly keeping 
master up to date on local resource provider state and associated internal 
consistency complications).

This approach should also simplify framework development as a framework would 
only need to watch a single channel to see operation status updates (no need to 
reconcile different information sources). The benefits of better status updates 
and simpler implementation IMO outweigh any benefits of the current approach 
(disclaimer: I filed the slightly inflammatory MESOS-9448).

What is keeping us from moving forward with (III) at this point?


Cheers,

Benjamin

> On Jan 3, 2019, at 11:30 PM, Benno Evers  wrote:
> 
> Hi Chun-Hung,
> 
> > imagine that there are 1k nodes and 10 active + 10 gone LRPs per node, then 
> > the master need to maintain 20k entries for LRPs.
> 
> How big would the required additional storage be in this scenario? Even if 
> it's 1KiB per LRP, using 20 MiB of extra memory doesn't sound too bad for 
> such a big custer.
> 
> In general, it seems hard to discuss the trade-offs between your proposals 
> without looking at the users of that API - do you know if there are ayn 
> frameworks out there that already use
>  operation reconciliation, and if so what do they do based on the 
> reconciliation response?
> 
> As far as I know, we don't have any formal guarantees on which operations 
> status changes the framework will receive without reconciliation. So putting 
> on my framework-implementer hat it seems like I'd have no choice but to 
> implement a continously polling background loop anyways if I care about 
> knowing the latest operation statuses. If this is indeed the case, having a 
> synchronous `RECONCILE_OPERATIONS` would seem to have little additional 
> benefit.
> 
> Best regards,
> Benno
> 
> On Wed, Dec 12, 2018 at 4:07 AM Chun-Hung Hsiao  wrote:
> Hi folks,
> 
> Recently I've being discussing the problems of the current design of the
> experimental
> `RECONCILE_OPERATIONS` scheduler API with a couple people. The discussion
> was started
> from MESOS-9318 : when a
> framework receives an `OPERATION_UNKNOWN`, it doesn't know
> if it should retry the operation or not (further details described below).
> As the discussion
> evolves, we realize there are more issues to consider, design-wise and
> implementation-wise, so
> I'd like to reach out to the community to get valuable opinions from you
> guys.
> 
> Before I jump right into the issues I'd like to discuss, let me fill you
> guys in with some
> background of operation reconciliation. Since the design of this feature
> was informed by the
> pre-existing implementation of task reconciliation, I'll begin there.
> 
> *Task Reconciliation: Design*
> 
> The scheduler API has a `RECONCILE` call for a framework to query the
> current statuses
> of its tasks. This call supports the following modes:
> 
>- *Explicit reconciliation*: The framework specifies the list of tasks
>it wants to know
>about, and expects status updates for these tasks.
> 
>- *Implicit reconciliation*: The framework does not specify a list of
>tasks, and simply
>expects status updates for all tasks the master knows about.
> 
> In both cases, the master looks into its in-memory task bookkeeping and
> sends
> *one or more`UPDATE` events* to respond to the reconciliation request.
> 
> *Task Reconciliation: Problems*
> 
> This API design of task reconciliation has the following shortcomings:
> 
>- (1) There is no clear boundary of when the "reconciliation response"
>ends, and thus
>there is
> *no 1-1 correspondence between the reconciliation request and the response*.
>For explicit reconciliation, the framework might wait for an extended 
> period
>of time before it receives all status updates; for implicit
>reconciliation, there is no way for
>a framework to tell if it has learned about all of its tasks, which
>could be inconvenient if
>the framework has lost its task bookkeeping.
> 
>- (2) The "reconciliation response" may be outdated. If an agent
>reregisters after a task
>reconciliation has been responded,
> *the framework wouldn't learn about the tasks **from this recovered agent*.
>Mesos relies on the framework to call the `RECONCILE` call
>*periodically* to get up-to-date task statuses.
> 
> 
> 
> *Operation Reconciliation: Design & Problems*
> 
> When designing operation reconciliation, we made the `RECONCILE_OPERATIONS`
> call
> *asynchronous 

Re: New scheduler API proposal: unsuppress and clear_filter

2018-12-10 Thread Benjamin Bannier
Hi Ben et al.,

I'd expect frameworks to *always* know how to accept or decline offers in 
general. More involved frameworks might know how to suppress offers. I don't 
expect that any framework models filters and their associated durations in 
detail (that's why I called them a Mesos implementation detail) since there is 
not much benefit to a framework's primary goal of running tasks as quickly as 
possible.

> I couldn't quite tell how you were imagining this would work, but let me 
> spell out the two models that I've been considering, and you can tell me if 
> one of these matches what you had in mind or if you had a different model in 
> mind:

> (1) "Effective limit" or "give me this much more" ...

This sounds more like an operator-type than a framework-type API to me. I'd 
assume that frameworks would not worry about their total limit the way an 
operator would, but instead care about getting resources to run a certain task 
at a point in time. I could also imagine this being easy to use incorrectly as 
frameworks would likely need to understand their total limit when issuing the 
call which could require state or coordination among internal framework 
components (think: multi-purpose frameworks like Marathon or Aurora).

> (2) "Matchers" or "give me things that look like this": when a scheduler 
> expresses its "request" for a role, it would act as a "matcher" (opposite of 
> filter). When mesos is allocating resources, it only proceeds if 
> (requests.matches(resources) && !filters.filtered(resources)). The open ended 
> aspect here is what a matcher would consist of. Consider a case where a 
> matcher is a resource quantity and multiple are allowed; if any matcher 
> matches, the result is a match. This would be equivalent to letting 
> frameworks specify their own --min_allocatable_resources for a role (which is 
> something that has been considered). The "matchers" could be more 
> sophisticated: full resource objects just like filters (but global), full 
> resource objects but with quantities for non-scalar resources like ports, etc.

I was thinking in this direction, but what you described is more involved than 
what I had in mind as a possible first attempt. I'd expect that frameworks 
currently use `REVIVE` as a proxy for `REQUEST_RESOURCES`, not as a way to 
manage their filter state tracked in the allocator. Assuming we have some way 
to express resource quantities (i.e., MESOS-9314), we should be able to improve 
on `REVIVE` by providing a `REQUEST_RESOURCES` which clears all filters for 
resource containing the requested resources (or all filters if no explicit 
resource request). Even if that let to more offers than needed it would likely 
still perform better than `REVIVE` (or `CLEAR_FILTERS` which has similar 
semantics). If we keep the scope of these calls narrow and clear we have 
freedom to be smarter in the future internally.

This should not only be pretty straight-forward to implement in Mesos, but I'd 
imagine also map pretty well onto framework use cases (i.e., I assume 
frameworks are interested in controlling the resources they are offered, not in 
managing filters we maintain for them).

> With regard to incentives, the incentive today for adhering to suppress is 
> that your framework will be doing less processing of offers when it has no 
> work to do and that other instances of your own framework as well as other 
> frameworks would get resources faster. The second aspect is indeed indirect. 
> The incentive structure with "request" / "demand" does indeed seem to be more 
> direct (while still having the indirect benefit on other frameworks / roles): 
> "I'll tell you what to show me so that I get it faster".

Additionally, by potentially explicitly introducing filters as a framework API 
concept, we ask the majority of framework authors to reason about an aspect 
they didn't have to worry about up until then (previously: "if work arrives, 
revive, and decline until an offer can be accepted, then suppress"). If we 
provided them something which fits their *current mental model* while also 
gives them more control, we have a higher chance of it being globally useful 
and adopted than if we'd add an expert-level knob.

> However, as far as performance is concerned, we still need suppress adoption 
> and not just request adoption. Suppress is actually the bigger performance 
> win at the current time, unless we think that frameworks with no work would 
> "effectively suppress" via requests (e.g. "no work? set a 0 request so 
> nothing matches"). Note though, that "effectively suppressing" via requests 
> has the same incentive structure as suppress itself, right?

I was also wondering about how what I suggested would fit here as we have two 
concepts controlling if and which offers a framework gets (a single global flag 
for suppress, and a zoo of many fine-grained filters). Currently we only expose 
`SUPPRESS`, `DECLINE`, and `REVIVE`. It seems that explicitly adding 

Re: New scheduler API proposal: unsuppress and clear_filter

2018-12-04 Thread Benjamin Bannier
Hi Meng,

thanks for the proposal, I agree that the way these two aspects are currently 
entangled is an issue (e.g., for master/allocator performance reasons). At the 
same time, the workflow we currently expect frameworks to follow is 
conceptually not hard to grasp,

(1) If framework has work then
(i) put framework in unsuppressed state,
(ii) decline not matching offers with a long filter duration.
(2) If an offer matches, accept.
(3) If there is no more work, suppress. GOTO (1).

Here the framework does not need to track its filters across allocation cycles 
(they are an unexposed implementation detail of the hierarchical allocator 
anyway) which e.g., allows metaschedulers like Marathon or Apache Aurora to 
decouple the scheduling of different workloads. A downside of this interface is 
that

* there is little incentive for frameworks to use SUPPRESS in addition to 
filters, and
* unsupression is all-or-nothing, forcing the master to send potentially all 
unused resources to one framework, even if it is only interested in a fraction. 
This can cause, at least temporal, non-optimal allocation behavior.

It seems to me that even though adding UNSUPPRESS and CLEAR_FILTERS would give 
frameworks more control, it would only be a small improvement. In above 
framework workflow we would allow a small improvement if the framework knows 
that a new workload matches a previously running workflow (i.e., it can infer 
that no filters for the resources it is interested in is active) so that it can 
issue UNSUPPRESS instead of CLEAR_FILTERS. Incidentally, there seems little 
local benefit for frameworks to use these new calls as they’d mostly help the 
master and I’d imagine we wouldn’t want to imply that clearing filters would 
unsuppress the framework. This seems too little to me, and we run the danger 
that frameworks would just always pair UNSUPPRESS and CLEAR_FILTERS (or keep 
using REVIVE) to simplify their workflow. If we’d model the interface more 
along framework needs, there would be clear benefit which would help adoption.

A more interesting call for me would be REQUEST_RESOURCES. It maps very well 
onto framework needs (e.g., “I want to launch a task requiring these 
resources”), and clearly communicates a requirement to the master so that it 
e.g., doesn’t need to remove all filters for a framework. It also seems to fit 
the allocator model pretty well which doesn’t explicitly expose filters. I 
believe implementing it should not be too hard if we'd restrict its semantics 
to only communicate to the master that a framework _is interested in a certain 
resource_ without promising that the framework _will get them in any amount of 
time_ (i.e., no need to rethink DRF fairness semantics in the hierarchical 
allocator). I also feel that if we have REQUEST_RESOURCES we would have some 
freedom to perform further improvements around filters in the master/allocator 
(e.g., filter compatification, work around increasing the default filter 
duration, …).


A possible zeroth implementation for REQUEST_RESOURCES with the hierarchical 
allocator would be to have it remove any filters containing the requested 
resource and likely to unsuppress the framework. A REQUEST_RESOURCES call would 
hold an optional resource and an optional AgentID; the case where both are 
empty would map onto CLEAR_FILTERS.


That being said, it might still be useful to in the future expose a low-level 
knob for framework allowing them to explicitly manage their filters.


Cheers,

Benjamin


On Dec 4, 2018, at 5:44 AM, Meng Zhu  wrote:
> 
> See my comments inline.
> 
> On Mon, Dec 3, 2018 at 5:43 PM Vinod Kone  wrote:
> 
>> Thanks Meng for the explanation.
>> 
>> I imagine most frameworks do not remember what stuff they filtered much
>> less figure out how previously filtered stuff  can satisfy new operations.
>> That sounds complicated!
>> 
> 
> Frameworks do not need to remember what filters they currently have. Only
> knowing
> the resource profiles of the current vs. the previous operation would help
> a lot.
> But yeah, even this may be too much complexity.
> 
>> 
>> But I like your example. So a suggestion we could make to frameworks could
>> be to use CLEAR_FILTERS when they have new work, e.g., scale up/down, new
>> app (they might want to use this even if they aren't suppressed!); and to
>> use UNSUPPRESS when they are rescheduling old work?
>> 
> 
> Yeah, these are the general guideline.
> 
> I want to echo and reemphasize that CLEAR_FILTERS is orthogonal to
> suppression.
> Framework should consider clearing filters regardless of suppression.
> 
> Ideally, when there is new different work, old irelavent filters should be
> cleared. This helps
> framework to get more offers and makes the allocator run faster (filter
> could take up
> bulk of the allocation time when they build up). On the flip side, calling
> CLEAR_FILTERS too often
> might also have performance implications (esp. if the master/allocator
> actors are already 

Re: Getting write access to our GitHub repo

2018-07-23 Thread Benjamin Bannier
Hi Vinod,

We (Jie, James, me) briefly discussed this topic and some implication over 
slack:

* I mentioned I was surprised how a vote on _moving the project repo to ASF 
gitbox_ turned into _moving the project repo to Github_.
* Jie mentioned that this would simplify (enable?) how we could close Github 
PRs. He also mentioned infra reliability.
* I mentioned that I believed that while it was in ASF’s interest to support us 
as long as ASF was around, I wasn’t sure the same would hold for Github.
* I wrote that personally I’d prefer improving limitations in our tooling over 
moving to Github.

That said, I’d prefer if we’d keep an ASF infra repo as source of truth like 
agreed on in the vote. We should get a clearer understanding of the limitations 
and limits of what ASF can provide before considering Github as source of 
truth. I personally do not yet see a true need.


Cheers,

Benjamin


> On Jul 23, 2018, at 8:44 PM, Jie Yu  wrote:
> 
>> 
>> 1) Merge strategy on GH. I think we want to use the "rebase and merge
>> > merges/#rebase-and-merge-your-pull-request-commits>"
>> strategy only (i.e., disable other strategies) to avoid merge commits. This
>> will be in parity with our RB based workflow.
> 
> 
> Sounds good! And we can "ban" the rest in github setting.
> 
> 2) One writable repo. Do we want to keep both github and gitbox repos as
>> writable repos or do we want to make github the only writable repo (and
>> make gibox a read only mirror)? One advantage is that this will avoid
>> conflicts (that need to be manually resolved) when people commit to both
>> repos independently and there is slowness in synchronization.
> 
> 
> +1 on making only github writable.
> 
> 3) Our RB server currently points to yet another mirror "
>> git.apache.org/mesos" which has occasionally given us issues when posting
>> reviews due to synchronization issues. Should we move our RB to point to
>> github too?
> 
> 
> +1 on switching to github
> 
> - Jie
> 
> On Mon, Jul 23, 2018 at 10:49 AM, Vinod Kone  wrote:
> 
>> Few things we need to finalize before the gitbox move.
>> 
>> 1) Merge strategy on GH. I think we want to use the "rebase and merge
>> > merges/#rebase-and-merge-your-pull-request-commits>"
>> strategy only (i.e., disable other strategies) to avoid merge commits. This
>> will be in parity with our RB based workflow.
>> 
>> 2) One writable repo. Do we want to keep both github and gitbox repos as
>> writable repos or do we want to make github the only writable repo (and
>> make gibox a read only mirror)? One advantage is that this will avoid
>> conflicts (that need to be manually resolved) when people commit to both
>> repos independently and there is slowness in synchronization.
>> 
>> 3) Our RB server currently points to yet another mirror "
>> git.apache.org/mesos" which has occasionally given us issues when posting
>> reviews due to synchronization issues. Should we move our RB to point to
>> github too?
>> 
>> Thanks,
>> 
>> On Sun, Jul 15, 2018 at 9:26 PM Jie Yu  wrote:
>> 
>>> Vinod, can you start a VOTE thread per our discussion during the
>>> committer's meeting.
>>> 
>>> On Sun, Jul 15, 2018 at 1:34 AM, Gastón Kleiman 
>>> wrote:
>>> 
 On Wed, Jun 20, 2018 at 7:59 PM Vinod Kone 
>> wrote:
 
> Hi folks,
> 
> Looks like ASF now supports  giving
>> write
> access to committers for their GitHub mirrors, which means we can
>> merge
 PRs
> directly on GitHub!
> 
 
 +1. Not only does it allow to merge PRs directly on GitHub, but it also
 allows committers to close stale PRs!
 
 -Gastón
 
>>> 
>> 



Re: Build failed in Jenkins: Mesos-Tidybot » -DENABLE_LIBEVENT=OFF -DENABLE_SSL=OFF,(docker||Hadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2) #1341

2018-07-23 Thread Benjamin Bannier
> Hmm. Is this new?

This is about a week old. There’s a fix in progress, 
https://reviews.apache.org/r/68001/.

@jpeach @drexin


> On Mon, Jul 23, 2018 at 11:04 AM Apache Jenkins Server <
> jenk...@builds.apache.org> wrote:
> 
>> See <
>> https://builds.apache.org/job/Mesos-Tidybot/CMAKE_ARGS=-DENABLE_LIBEVENT=OFF%20-DENABLE_SSL=OFF,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/1341/display/redirect?page=changes
>>> 
>> 
>> Changes:
>> 
>> [vinodkone] Document SUPPRESS HTTP call [MESOS-7211].
>> 
>> --
>> [...truncated 392.74 KB...]
>> /usr/bin/make -f 3rdparty/CMakeFiles/googletest-1.8.0.dir/build.make
>> 3rdparty/CMakeFiles/googletest-1.8.0.dir/depend
>> make[3]: Entering directory '/BUILD'
>> cd /BUILD && /usr/local/bin/cmake -E cmake_depends "Unix Makefiles"
>> /tmp/SRC /tmp/SRC/3rdparty /BUILD /BUILD/3rdparty
>> /BUILD/3rdparty/CMakeFiles/http_parser-2.6.2.dir/DependInfo.cmake --color=
>> make[3]: Entering directory '/BUILD'
>> cd /BUILD && /usr/local/bin/cmake -E cmake_depends "Unix Makefiles"
>> /tmp/SRC /tmp/SRC/3rdparty /BUILD /BUILD/3rdparty
>> /BUILD/3rdparty/CMakeFiles/libarchive-3.3.2.dir/DependInfo.cmake --color=
>> make[3]: Entering directory '/BUILD'
>> cd /BUILD && /usr/local/bin/cmake -E cmake_depends "Unix Makefiles"
>> /tmp/SRC /tmp/SRC/3rdparty /BUILD /BUILD/3rdparty
>> /BUILD/3rdparty/CMakeFiles/glog-0.3.3.dir/DependInfo.cmake --color=
>> make[3]: Entering directory '/BUILD'
>> cd /BUILD && /usr/local/bin/cmake -E cmake_depends "Unix Makefiles"
>> /tmp/SRC /tmp/SRC/3rdparty /BUILD /BUILD/3rdparty
>> /BUILD/3rdparty/CMakeFiles/boost-1.65.0.dir/DependInfo.cmake --color=
>> make[3]: Entering directory '/BUILD'
>> cd /BUILD && /usr/local/bin/cmake -E cmake_depends "Unix Makefiles"
>> /tmp/SRC /tmp/SRC/3rdparty /BUILD /BUILD/3rdparty
>> /BUILD/3rdparty/CMakeFiles/libev-4.22.dir/DependInfo.cmake --color=
>> make[3]: Entering directory '/BUILD'
>> cd /BUILD && /usr/local/bin/cmake -E cmake_depends "Unix Makefiles"
>> /tmp/SRC /tmp/SRC/3rdparty /BUILD /BUILD/3rdparty
>> /BUILD/3rdparty/CMakeFiles/concurrentqueue-7b69a8f.dir/DependInfo.cmake
>> --color=
>> make[3]: Entering directory '/BUILD'
>> cd /BUILD && /usr/local/bin/cmake -E cmake_depends "Unix Makefiles"
>> /tmp/SRC /tmp/SRC/3rdparty /BUILD /BUILD/3rdparty
>> /BUILD/3rdparty/CMakeFiles/picojson-1.3.0.dir/DependInfo.cmake --color=
>> make[3]: Entering directory '/BUILD'
>> cd /BUILD && /usr/local/bin/cmake -E cmake_depends "Unix Makefiles"
>> /tmp/SRC /tmp/SRC/3rdparty /BUILD /BUILD/3rdparty
>> /BUILD/3rdparty/CMakeFiles/protobuf-3.5.0.dir/DependInfo.cmake --color=
>> make[3]: Entering directory '/BUILD'
>> cd /BUILD && /usr/local/bin/cmake -E cmake_depends "Unix Makefiles"
>> /tmp/SRC /tmp/SRC/3rdparty /BUILD /BUILD/3rdparty
>> /BUILD/3rdparty/CMakeFiles/elfio-3.2.dir/DependInfo.cmake --color=
>> make[3]: Entering directory '/BUILD'
>> cd /BUILD && /usr/local/bin/cmake -E cmake_depends "Unix Makefiles"
>> /tmp/SRC /tmp/SRC/3rdparty /BUILD /BUILD/3rdparty
>> /BUILD/3rdparty/CMakeFiles/googletest-1.8.0.dir/DependInfo.cmake --color=
>> make[3]: Leaving directory '/BUILD'
>> /usr/bin/make -f 3rdparty/CMakeFiles/libarchive-3.3.2.dir/build.make
>> 3rdparty/CMakeFiles/libarchive-3.3.2.dir/build
>> make[3]: Leaving directory '/BUILD'
>> /usr/bin/make -f 3rdparty/CMakeFiles/libev-4.22.dir/build.make
>> 3rdparty/CMakeFiles/libev-4.22.dir/build
>> make[3]: Leaving directory '/BUILD'
>> /usr/bin/make -f 3rdparty/CMakeFiles/protobuf-3.5.0.dir/build.make
>> 3rdparty/CMakeFiles/protobuf-3.5.0.dir/build
>> make[3]: Leaving directory '/BUILD'
>> make[3]: Leaving directory '/BUILD'
>> make[3]: Leaving directory '/BUILD'
>> /usr/bin/make -f 3rdparty/CMakeFiles/glog-0.3.3.dir/build.make
>> 3rdparty/CMakeFiles/glog-0.3.3.dir/build
>> make[3]: Leaving directory '/BUILD'
>> /usr/bin/make -f 3rdparty/CMakeFiles/elfio-3.2.dir/build.make
>> 3rdparty/CMakeFiles/elfio-3.2.dir/build
>> /usr/bin/make -f 3rdparty/CMakeFiles/http_parser-2.6.2.dir/build.make
>> 3rdparty/CMakeFiles/http_parser-2.6.2.dir/build
>> /usr/bin/make -f 3rdparty/CMakeFiles/boost-1.65.0.dir/build.make
>> 3rdparty/CMakeFiles/boost-1.65.0.dir/build
>> make[3]: Leaving directory '/BUILD'
>> make[3]: Leaving directory '/BUILD'
>> /usr/bin/make -f 3rdparty/CMakeFiles/picojson-1.3.0.dir/build.make
>> 3rdparty/CMakeFiles/picojson-1.3.0.dir/build
>> Scanning dependencies of target concurrentqueue-7b69a8f
>> make[3]: Entering directory '/BUILD'
>> make[3]: Nothing to be done for
>> '3rdparty/CMakeFiles/libarchive-3.3.2.dir/build'.
>> /usr/bin/make -f 3rdparty/CMakeFiles/googletest-1.8.0.dir/build.make
>> 3rdparty/CMakeFiles/googletest-1.8.0.dir/build
>> make[3]: Leaving directory '/BUILD'
>> make[3]: Entering directory '/BUILD'
>> make[3]: Nothing to be done for '3rdparty/CMakeFiles/glog-0.3.3.dir/build'.
>> make[3]: Leaving directory '/BUILD'
>> make[3]: Entering directory '/BUILD'
>> make[3]: Entering directory '/BUILD'
>> make[3]: Nothing to 

Re: mesos git commit: Added mpsc_linked_queue and use it as the concurrent event queue.

2018-07-16 Thread Benjamin Bannier
Hi Dario,

this patch introduced two new clang-tidy warnings. Could we try to get these 
down to zero, even if the code does not look bad?


I already created a patch for the unused lambda capture,

https://reviews.apache.org/r/67927/

While the code does look reasonable, as a somewhat weird exception C++ allows 
referencing some variables without capturing them.


I also looked into the warning on the “excessive padding”. Adding some explicit 
padding seems to make clang-tidy content, but I wasn’t sure whether we just 
wanted to put `head` and `tail` on separate cache lines, or also cared about 
the padding added after `tail`.

   private:
  std::atomic*> head;

  char padding[128 - sizeof(std::atomic*>)];

  // TODO(drexin): Programatically get the cache line size.
  alignas(128) Node* tail; // FIXME: IMO no need for `alignas` to 
separate `head` and `tail`.

Could you put up a patch for that? You can run the linter yourself; it is 
`support/mesos-tidy.sh`.


Cheers,

Benjamin


> On Jul 15, 2018, at 7:02 PM, b...@apache.org wrote:
> 
> Repository: mesos
> Updated Branches:
>  refs/heads/master a11a6a3d8 -> b1eafc035
> 
> 
> Added mpsc_linked_queue and use it as the concurrent event queue.
> 
> https://reviews.apache.org/r/62515
> 
> 
> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/b1eafc03
> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/b1eafc03
> Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/b1eafc03
> 
> Branch: refs/heads/master
> Commit: b1eafc035426bc39df4dba81c5c46b8b2d970339
> Parents: a11a6a3
> Author: Dario Rexin 
> Authored: Sat Jul 7 13:20:22 2018 -0700
> Committer: Benjamin Hindman 
> Committed: Sun Jul 15 09:55:28 2018 -0700
> 
> --
> 3rdparty/libprocess/Makefile.am |   1 +
> 3rdparty/libprocess/src/event_queue.hpp | 168 ++---
> 3rdparty/libprocess/src/mpsc_linked_queue.hpp   | 179 +++
> 3rdparty/libprocess/src/tests/CMakeLists.txt|   1 +
> 3rdparty/libprocess/src/tests/benchmarks.cpp|  64 ++-
> .../src/tests/mpsc_linked_queue_tests.cpp   | 104 +++
> 6 files changed, 367 insertions(+), 150 deletions(-)
> --
> 
> 
> http://git-wip-us.apache.org/repos/asf/mesos/blob/b1eafc03/3rdparty/libprocess/Makefile.am
> --
> diff --git a/3rdparty/libprocess/Makefile.am b/3rdparty/libprocess/Makefile.am
> index 2d356aa..631491a 100644
> --- a/3rdparty/libprocess/Makefile.am
> +++ b/3rdparty/libprocess/Makefile.am
> @@ -307,6 +307,7 @@ libprocess_tests_SOURCES =
> \
>   src/tests/loop_tests.cpp\
>   src/tests/main.cpp  \
>   src/tests/metrics_tests.cpp \
> +  src/tests/mpsc_linked_queue_tests.cpp  \
>   src/tests/mutex_tests.cpp   \
>   src/tests/owned_tests.cpp   \
>   src/tests/process_tests.cpp \
> 
> http://git-wip-us.apache.org/repos/asf/mesos/blob/b1eafc03/3rdparty/libprocess/src/event_queue.hpp
> --
> diff --git a/3rdparty/libprocess/src/event_queue.hpp 
> b/3rdparty/libprocess/src/event_queue.hpp
> index 21c522d..999d552 100644
> --- a/3rdparty/libprocess/src/event_queue.hpp
> +++ b/3rdparty/libprocess/src/event_queue.hpp
> @@ -17,10 +17,6 @@
> #include 
> #include 
> 
> -#ifdef LOCK_FREE_EVENT_QUEUE
> -#include 
> -#endif // LOCK_FREE_EVENT_QUEUE
> -
> #include 
> #include 
> 
> @@ -28,6 +24,10 @@
> #include 
> #include 
> 
> +#ifdef LOCK_FREE_EVENT_QUEUE
> +#include "mpsc_linked_queue.hpp"
> +#endif // LOCK_FREE_EVENT_QUEUE
> +
> namespace process {
> 
> // A _multiple_ producer (MP) _single_ consumer (SC) event queue for a
> @@ -187,185 +187,55 @@ private:
> #else // LOCK_FREE_EVENT_QUEUE
>   void enqueue(Event* event)
>   {
> -Item item = {sequence.fetch_add(1), event};
> if (comissioned.load()) {
> -  queue.enqueue(std::move(item));
> +  queue.enqueue(event);
> } else {
> -  sequence.fetch_sub(1);
>   delete event;
> }
>   }
> 
>   Event* dequeue()
>   {
> -// NOTE: for performance reasons we don't check `comissioned` here
> -// so it's possible that we'll loop forever if a consumer called
> -// `decomission()` and then subsequently called `dequeue()`.
> -Event* event = nullptr;
> -do {
> -  // Given the nature of the concurrent queue implementation it's
> -  // possible that we'll need to try to dequeue multiple times
> -  // until it returns an event even though we know there is an
> -  // event because 

Re: RFC: update C++ style to require the "override" keyword

2018-07-09 Thread Benjamin Bannier
Hi,

>> Note that since our style guide _is_ the Google style guide plus some
>> additions, we shouldn't need to update anything in our style guide; the 
>> Google
>> style guide seems to have started requiring this from February this year and
>> our code base just got out of sync
> 
> I'd prefer to hoist the rationale up to our guide, since the google one is 
> pretty long and I don't expect us to all re-read it regularly :)

I can see that and am not strongly opposed to that approach.

At the same time that way we’d then require contributors to parse _both_ our 
and Google’s C++ style guide to understand what style we prefer (plus figuring 
out the delta). Since in this case we should be able to perform automated 
checks, it should be possible to educate contributors about this requirement 
even without explicitly calling it out ourself, so we could avoid putting more 
reading between them and their contribution. I’ll put up a RR so we mention 
both `support/mesos-style.py` and `support/mesos-tidy.sh` in the style guide.


> look for a review request in the near future

Sure! https://silverinahaystack.files.wordpress.com/2016/10/duck-and-cover-2.png


Cheers,

Benjamin

Re: RFC: update C++ style to require the "override" keyword

2018-07-08 Thread Benjamin Bannier
Hi James,

> I’d like to propose that we update our style to require that the
> “override” keyword always be used when overriding virtual functions
> (including destructors). The proposed text is below. I’ll also prepare
> a clang-tidy patch to update stout, libprocess and mesos globally.

+1!

Thanks for bringing this up and offering to do the clean-up. Using `override`
consistently would really give us some certainty as interface methods evolve.

* * *

Note that since our style guide _is_ the Google style guide plus some
additions, we shouldn't need to update anything in our style guide; the Google
style guide seems to have started requiring this from February this year and
our code base just got out of sync.

I believe we should activate the matching warning in our `cpplint` setup,

--- a/support/mesos-style.py
+++ b/support/mesos-style.py
@@ -256,6 +256,7 @@ class CppLinter(LinterBase):
 'build/endif_comment',
 'build/nullptr',
 'readability/todo',
+'readability/inheritance',
 'readability/namespace',
 'runtime/vlog',
 'whitespace/blank_line',


While e.g., `clang` already emits a diagnostic for hidden `virtual` functions
we might still want to update our `clang-tidy` setup. There is a dedicated
linter for `override` which me might not need due to the default diagnostic,

--- a/support/clang-tidy
+++ b/support/clang-tidy
@@ -25,6 +25,7 @@ google-*,\
 mesos-*,\
 \
 misc-use-after-move,\
+modernize-use-override,\
 \
 readability-redundant-string-cstr\
 "

but it probably makes a lot of sense to check what other compile-time Mesos
features can be enabled by default in our `clang-tidy` setup (either in Jenkins
via `CMAKE_ARGS`, or even better globally by default in
`support/mesos-tidy/entrypoint.sh:31ff`).

I would guess that using `cpplint` to verifying automated fixes made with
`clang-tidy` could inform what flags should have been added (there are some
missing features in the cmake build though, e.g., some isolators which would
have benefited from `override` recently).


Cheers,

Benjamin



Re: Should we remove `noexcept` from `ObjectApprover::approved()` signature?

2018-06-14 Thread Benjamin Bannier
Hi,

I still believe that declaring methods of this module interface `except` is a 
good thing which IMO we should also do for all new module interfaces going 
forward. We do not perform any exception handling around calls to these 
functions in Mesos, and `noexcept` is intended to communicate exactly that.  


Googletest (which has in the meantime absorbed googlemock) is preparing their 
1.9.0 release with C++11 support, and it is expected to be released “soon”, so 
I am not sure that breaking backwards compatibility to end up with weaker 
interfaces will be that worthwhile in the long run. I left a sketch for a 
possible workaround in the issue you created, 
https://issues.apache.org/jira/browse/MESOS-8991?focusedCommentId=16512313.

Let’s continue this discussion on the dev mailing list.


Cheers,


Benjamin


> On Jun 14, 2018, at 12:47 PM, Benno Evers  wrote:
> 
> Hi Alexander,
> 
> > and it is compiled without exception support by default.
> 
> What exactly do you mean by "without support"? My local libmesos.so includes 
> 500 KiB of unwind tables, and we had issues like MESOS-8417 that are caused 
> by unexpected exceptions being thrown.
> 
> On Thu, Jun 14, 2018 at 12:10 PM, Alexander Rojas  
> wrote:
> I may have brought up this issue in the past, however I’m bringing it again, 
> The `ObjectApprover::approved()` [1] method has the following signature:
> 
> ```
> virtual Try approved(
>   const Option& object) const noexcept = 0;
> ```
> 
> This is unfortunate since it is impossible to mock a function in google mock 
> with two qualifiers [2] without some modifications to gmock itself. this 
> reduces the amount of tests we can perform.
> 
> Moreover, the `noexcept` qualifier is not even needed in Mesos, since it does 
> not use exceptions and it is compiled without exception support by default.
> 
> The tricky situation here is that this is a public API so it would be tricky 
> to replace since it will break backwards compatibility. So I’m calling out to 
> any modules developer to notify if they are ok with the change or if we 
> should instead try to modify gmock.
> 
> [1] 
> https://github.com/apache/mesos/blob/8b93fa3/include/mesos/authorizer/authorizer.hpp#L221
> [2] https://groups.google.com/forum/#!topic/googlemock/LsbubY26qx4
> 
> 
> 
> Alexander Rojas
> alexan...@mesosphere.io
> 
> 
> 
> 
> 
> 
> 
> -- 
> Benno Evers
> Software Engineer, Mesosphere



Re: [Performance WG] Notes from meeting today

2018-05-16 Thread Benjamin Bannier
Hi Ben,

thanks for taking the time to edit and share these detailed notes. Being
able to asynchronously see the great work folks are doing surfaced is
great, especially when put into context with thought like here.


Benjamin

> On May 16, 2018, at 8:06 PM, Benjamin Mahler  wrote:
> 
> Hi folks,
> 
> Here are some notes from the performance meeting today.
> 
> (1) First I did a demo of flamescope, you can find it here:
> https://github.com/Netflix/flamescope
> 
> It's a very useful tool, hopefully we can make it easier for users to
> generate the data that we can drop into flamescope when reporting any
> performance issues. One of the open questions is how `perf --call-graph
> dwarf` compares to `perf -g` but with mesos compiled with frame pointers. I
> haven't had time to check this yet.
> 
> When playing with the tool, it was easy to find some hot spots in the given
> cluster I was looking at (which was not necessarily representative). For
> the agent, jie filed:
> 
> https://issues.apache.org/jira/browse/MESOS-8901
> 
> And for the master, I noticed that metrics, state json generation (no
> surprise), and a particular spot in the allocator were very expensive.
> 
> Metrics we'd like to address via migration to push gauges (Zhitao has
> offered to help with this effort):
> 
> https://issues.apache.org/jira/browse/MESOS-8914
> 
> The state generation we'd like to address via streaming state into a
> separate actor (and providing filtering as well), this will get further
> investigated / prioritized very soon:
> 
> https://issues.apache.org/jira/browse/MESOS-8345
> 
> (2) Kapil discussed benchmarks for the long standing "offer starvation"
> issue:
> 
> https://issues.apache.org/jira/browse/MESOS-3202
> 
> I'll send out an email or document soon with some background on this issue
> as well as our options to address it.
> 
> Let me know if you have any questions or feedback!
> 
> Ben



Re: Introducing `support/mesos-build.sh`

2018-03-20 Thread Benjamin Bannier
Done.

> On Mar 20, 2018, at 10:46 AM, Tomek Janiszewski  wrote:
> 
> Thanks for merging. Can we push this image to dockerhub tagged as
> ubuntu-16.04-arm? Then we will need to update CI configuration to:
> JOBS=16 OS=ubuntu-16.04-arm BUILDTOOL=cmake COMPILER=gcc
> CONFIGURATION='--disable-java --disable-python --disable-libtool-wrappers'
> ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1' ../mesos-build.sh
> or
> JOBS=16 OS=ubuntu-16.04-arm BUILDTOOL=autotools COMPILER=gcc
> CONFIGURATION='--disable-java --disable-python --disable-libtool-wrappers'
> ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1' ./support/mesos-build.sh
> 
> 
> pon., 19 mar 2018 o 16:24 użytkownik Tomek Janiszewski 
> napisał:
> 
>> Dockerfile for ARM with CMake support https://reviews.apache.org/r/66138/
>> 
>> pon., 12 lut 2018 o 15:41 użytkownik Tomek Janiszewski 
>> napisał:
>> 
>>> How can I use it for our ARM CI?
>>> 
>>> JOBS=16 OS=arm64v8/ubuntu:16.04  ./support/mesos-build.sh
>>> + set -e
>>> + set -o pipefail
>>> + : arm64v8/ubuntu:16.04
>>> + : autotools
>>> + : gcc
>>> + : '--verbose --disable-libtool-wrappers'
>>> + : 'GLOG_v=1 MESOS_VERBOSE=1'
>>> + : 16
>>> ++ git rev-parse --show-toplevel
>>> + MESOS_DIR=/home/janisz/mesos
>>> ++ git diff-index --quiet HEAD --
>>> + docker run --rm -v /home/janisz/mesos:/SRC:Z -e BUILDTOOL=autotools -e
>>> COMPILER=gcc -e 'CONFIGURATION=--verbose --disable-libtool-wrappers' -e
>>> 'ENVIRONMENT=GLOG_v=1 MESOS_VERBOSE=1' -e JOBS=16
>>> mesos/mesos-build:arm64v8/ubuntu-16.04
>>> docker: invalid reference format.
>>> See 'docker run --help'.
>>> 
>>> 
>>> czw., 8 lut 2018 o 07:38 użytkownik Michael Park 
>>> napisał:
>>> 
 The first run looks good!
 https://builds.apache.org/job/Mesos-Buildbot/4890/
 
 [image: Screen Shot 2018-02-07 at 10.30.51 PM.png]
 On Wed, Feb 7, 2018 at 8:39 PM Michael Park  wrote:
 
> Yep, Just landed! Waiting for
 https://builds.apache.org/job/Mesos-Buildbot to
> pick it up.
> 
> On Wed, Feb 7, 2018 at 8:27 PM Vinod Kone 
 wrote:
> 
>> Yay, thanks MPark! Has the change landed already?
>> 
>> On Wed, Feb 7, 2018 at 8:23 PM, Michael Park 
 wrote:
>> 
>>> Many of you probably know that we currently have
>> `support/docker-build.sh`
>>> to power our CI for our various configurations. One of the problems
 for
>> us
>>> has been that we create a `Dockerfile` ad-hoc and invoke `docker
 build`
>>> with it. This is very inefficient and also leads to flaky issues
 around
>>> `apt-get install`.
>>> 
>>> I've introduced `support/mesos-build.sh` which operates off of
 docker
>>> images hosted on Dockerhub instead, and should aid in bringing us
 faster
>>> and more stable CI results!
>>> 
>>> As a bonus, we now also test Clang on the CentOS 7!
>>> 
>>> Thanks,
>>> 
>>> MPark
>>> 
>> 
> 
 
>>> 



API change to augment resource provider information served in master and agent endpoints

2018-03-14 Thread Benjamin Bannier
Hi,

this is a heads up that we would like to augment the master and agent HTTP APIs 
to also serve resource provider resources in `GET_AGENTS` and 
`GET_RESOURCE_PROVIDER` responses, respectively. This change does not remove or 
change the meaning of existing fields, but is strictly additive.

The relevant diffs are

* master HTTP API: 
https://reviews.apache.org/r/65833/diff/2#collapsed-chunk0.0, 
https://reviews.apache.org/r/65833/diff/2#collapsed-chunk1.0, and
* agent HTTP API: https://reviews.apache.org/r/65832/diff/2#collapsed-chunk0.0, 
https://reviews.apache.org/r/65832/diff/2#collapsed-chunk1.0.

If we do not hear from anyone in the next 3 days we will go ahead and make the 
proposed changes.

Cheers,

Benjamin

Release checksum file distribution change

2018-03-12 Thread Benjamin Bannier
Hi,

this is a heads-up that future Mesos release checksum files will be SHA512,
e.g., `mesos-1.6.0.tar.gz.sha512`. The previously used MD5 checksum files will
not be used anymore for future releases.

Please update any dependent tooling you have on your side accordingly.


Best,

Benjamin


Re: Reconsidering `allocatable` check in the allocator

2018-03-07 Thread Benjamin Bannier
Hi,

> Chatted with BenM offline on this. There's another option what both of us
> agreed that it's probably better than any of the ones mentioned above.
> 
> The idea is to make `allocable` return the portion of the input resources
> that are allocatable, and strip the unelectable portion.
> 
> For example:
> 1) If the input resources are "cpus:0.001,gpus:1", the `allocatable` method
> will return "gpus:1".
> 2) If the input resources are "cpus:1,mem:1", the `allocatable` method will
> return "cpus:1".
> 3) If the input resources are "cpus:0.001,mem:1", the `allocatable` method
> will return an empty Resources object.
> 
> Basically, the algorithm is like the following:
> 
> allocatable = input
> foreach known resource type t: do
>  r = resources of type t from the input
>  if r is less than the min resource of type t; then
>allocatable -= r
>  fi
> done
> return allocatable

I think that sounds like a faithful extension the current behavior to me 
(removing too small resources from the offerable pool), but I feel we should 
not just filter out any resource _kind_  below the minimum, but inside a kind 
all _addable_ subresources,

allocatable : Resources = input
  for (resource: Resource) in input:
if resource < min(resource.kind):
  allocatable -= resource

return allocatable

This would have the effect of clumping together each distinguishable resource 
we care about instead of of accumulating say different disks which in sum are 
potentially not that more interesting to frameworks (they would prefer more of 
a particular disk than smaller pieces scattered across multiple disks).

@alexr
> If we are about to offer some of the resources from a particular agent, why
> would we filter anything at all? I doubt we should be concerned about the
> size of the offer representation travelling through the network. If
> available resources are "cpus:0.001,gpus:1" and we want to allocate GPU,
> what is the benefit of filtering CPU?
> 
> What about the following:
> allocatable(R)
> {
>  return true
>iff (there exists r in R for which size(r) > MIN(type(r)))
> }

I think this is less about communication overhead, but more a tool to help to 
make sure that offered resources are actually useful to frameworks. If we would 
completely remove the current behavior of clumping resources it might take a 
long time for frameworks to actually receive sufficient interesting resources. 
While frameworks can use filters to prevent some offers, to filter out an offer 
we currently always require that the filtered resources are a superset of the 
resources we are about to offer. As the number of possible dimensions (e.g., 
resource kinds, labels, other fields) increases it becomes harder and harder 
for filters to be effective in this regard and the allocator needs to step in.

https://en.wikipedia.org/wiki/Curse_of_dimensionality


Cheers,

Benjamin



Re: [VOTE] C++14 Upgrade

2018-02-12 Thread Benjamin Bannier
+1.

I believe the spreadsheet linked in MESOS-7949 makes it pretty clear that the 
benefits outweigh the required build requirement changes.


> On Feb 10, 2018, at 6:28 AM, Michael Park  wrote:
> 
> I'm going to put this up for a vote. My plan is to bump us to C++14 on Feb
> 21.
> 
> The following are the proposed changes:
>  - Minimum GCC *4.8.1* => *5*.
>  - Minimum Clang *3.5* => *3.6*.
>  - Minimum Apple Clang *8* => *9*.
> 
> We'll have a standard voting, at least 3 binding votes, and no -1s.
> 
> Thanks!
> 
> MPark



Re: Please use `int_fd` instead of `int` for file descriptors

2017-12-01 Thread Benjamin Bannier
Hi,

> I'm not sure how to actually help with the issue of making `int_fd`
> more discoverable. The only idea I've got is a ClangTidy check to
> complain about variables of type `int` named `fd` and other similar
> common names for file descriptors such as `socket`.

I was wondering about this as well.

It seems like we already provide a pretty comprehensive set of stout
library functions to create file descriptors. As an example, I see
`net::socket`, so user code directly calling `::socket` seems not like
something we'd want and we should rather add missing functionality to
our library functions than completely avoid them. If we use wrappers it
should be trivial to catch undesirable use of unwrapped functions given
some list of wrapper functions. We have an existing ticket to create
such a check, https://issues.apache.org/jira/browse/MESOS-5105; please
feel to add interesting wrapper functions to it.

We might also want to consider making `int_ft` a tighter type so that
e.g., conversions to `int` require explicit user action. That might
throw another wrench into too careless work.


Cheers,

Benjamin


Re: Differing DRF flavors over roles and frameworks

2017-11-30 Thread Benjamin Bannier
Hi Ben,

and thank you for answering.

> > For frameworks in the same role on the other hand we choose to normalize
> > with the allocated resources
> 
> Within a role, the framework's share is evaluated using the *role*'s total
> allocation as a denominator. Were you referring to the role's total
> allocation when you said "allocated resources"?

Yes.

> I believe this was just to reflect the "total pool" we're sharing within.
> For roles, we're sharing the total cluster as a pool. For frameworks within
> a role, we're sharing the role's total allocation as a pool amongst the
> frameworks. Make sense?

Looking at the allocation loop, I see that while a role sorter uses the
actual cluster resources when generating a sorting, we only seem to
update the total in the picked framework sorter with an `add` at the end
of the allocation loop, so at the very least the "total pool" of
resources in a single role seems to lag. Should this update be moved to
the top of the loop?

> The sort ordering should be the same no matter which denominator you
> choose, since everyone gets the same denominator. i.e. 1,2,3 are ordered
> the same whether you're evaluating their share as 1/10,2/10,3/10 or
> 1/100,2/100,3/100, etc.

This seems to be only true if we have just a single resource kind. For
multiple resource kinds we are not just dealing with a single scale
factor, but will also end up comparing single-resource scales against
each other in DRF.

Here's a brief example of a cluster with two frameworks where we end up
with different DRF weights `f` depending on whether the frameworks are in
the same role or not.

- Setup:
  * cluster total: cpus:40; mem:100; disk:1000
  * cluster used:  cpus:30; mem:  2; disk:   5

  * framework 'a': used=cpus:20; mem:1; disk:1
  * framework 'b': used=cpus:10; mem:1; disk:4

- both frameworks in separate roles
  * framework 'a', role 'A'; role shares: cpus:2/4; mem:1/100; disk:1/1000; 
f=2/4
  * framework 'b', role 'B'; role shares: cpus:1/4; mem:1/100; disk:2/1000; 
f=1/4

- both frameworks in same role:
  * framework 'a': framework shares: cpus:2/3; mem:1/2; disk:1/4; f=1/2
  * framework 'b': framework shares: cpus:1/3; mem:1/2; disk:4/5; f=4/5

If each framework is in its own role we would allocate the next resource
to 'b'; if the frameworks are in the same role we would allocate to 'a'
instead. This is what I meant with

> It appears to me that by normalizing with the used resources inside a role
> we somehow bias allocations inside a role against frameworks with “unusual”
> usage vectors (relative to other frameworks in the same role). 

In this example we would penalize 'b' for having a usage vector very
different from 'a' (here: along the `disk` axis).


Benjamin


Differing DRF flavors over roles and frameworks

2017-11-29 Thread Benjamin Bannier
Hi,

the DRF flavors we use in our hierarchical allocator slightly differ between 
how we identify the role and the framework most under fair share.

In DRF each actual usage is normalized to some “total”. For roles we use the 
total resources in the cluster (or for quota the total non-revocable 
resources), see e.g., 
https://github.com/apache/mesos/blob/bf507a208da3df360294896f083dd163004324aa/src/master/allocator/mesos/hierarchical.cpp#L521-L524.
 For frameworks in the same role on the other hand we choose to normalize with 
the allocated resources, see e.g., 
https://github.com/apache/mesos/blob/bf507a208da3df360294896f083dd163004324aa/src/master/allocator/mesos/hierarchical.cpp#L551.
 This means that we e.g., need to update the denominators in the framework 
sorters whenever we made an allocation.

These approaches are not equivalent, and I was wondering what the reason for 
the deviating DRF for frameworks was.

It appears to me that by normalizing with the used resources inside a role we 
somehow bias allocations inside a role against frameworks with “unusual” usage 
vectors (relative to other frameworks in the same role). We do not seem to 
document any such intention, and I am also unsure about the usefulness for such 
a bias in a world with large, possibly deeply nested role trees. More from an 
aesthetic point of view, the special treatment for frameworks seems in the same 
role seems to break symmetry (necessitating special treatment), and having to 
update the denominators in framework sorters on each allocations also seems to 
potentially introduce extra churn in framework sorter.

Does anybody recall why we use a different DRF flavor for frameworks in the 
same role as opposed to the one used across roles?


Cheers,

Benjamin

Re: install function

2017-09-14 Thread Benjamin Bannier
Hi,

>   I now read Mesos-1.3.0 sourcecode ,but i am in troubled for Master 
> initialize() function。 can you help me?
> 
> in Master’s initialize() function :
> install(
>::registerSlave,
>::slave,
>::checkpointed_resources,
>::version,
>::agent_capabilities);

It is defined in the header file, e.g.,

  
https://github.com/apache/mesos/blob/38cb694f55ec8ab69efa83e1b958eac071fee2d4/3rdparty/libprocess/include/process/protobuf.hpp#L208-L228


Cheers,

Benjamin

Re: [Design Doc] Native Support for Prometheus Metrics

2017-09-09 Thread Benjamin Bannier
Hi James,

I'd like to make a longer comment here to make it easier to discuss.

> [...]
> 
> Note the proposal to alter how Timer metrics are exposed in an incompatible
> way (I argue this is OK because you can't really make use of these metrics
> now).

I am not sure I follow your argument around `Timer`. It is similar to a gauge
caching the last value and an associated statistics calculated from a time 
series.

I have never used Prometheus, but a brief look at the Prometheus
docs seems to suggest that a `Timer` could be mapped onto a Prometheus summary
type with minimal modifications (namely, by adding a `sum` value that you
propose as sole replacement).

I believe that exposing statistics is useful, and moving all `Timer` metrics to
counters (cumulative value and number of samples) would leads to information
loss.

Since most of your criticism of `Timer` is about it its associated statistics,
maybe we can make fixes to libprocess' `TimeSeries` and the derived
`Statistics` to make them more usable. Right now `Statistics` seems to be more
apt for dealing with timing measurements where one probably worries more about
the long tail of the distribution (it only exposes the median and higher
percentiles). It seems that if one would e.g., make the exposed percentiles
configurable, it should be possible to expose a useful characterization of the
underlying distribution (think: box plot). It might be that one would need to
revisit how `TimeSeries` sparsifies older data to make sure the quantiles we
expose are meaningful.

> First, note that the “allocator/mesos/allocation_run_ms/count” sample is not
> useful at all. It has the semantics of a saturating counter that saturates at
> the size of the bounded time series. To address this, there is another metric
> “allocator/mesos/allocation_runs”, which tracks the actual count of
> allocation runs (3161331.00 in this case). If you plot this counter over time
> (ie. as a rate), it will be zero for all time once it reaches saturation. In
> the case of allocation runs, this is almost all the time, since 1000
> allocations will be performed within a few hours.

While `count` is not a useful measure of the behavior of the measured datum, it
is critical to assess whether the derived statistic is meaningful (sample
size). Like you write, it becomes less interesting once enough data was
collected.

> Finally, while the derived statistics metrics can be informative, they are
> actually less expressive than a raw histogram would be. A raw histogram of
> timed values would allow an observer to distinguish cases where there are
> clear performance bands (e.g. when allocation times cluster at either 15ms or
> 200ms), but the percentile statistics obscure this information.

I would argue that is more a problem of `Statistics` only reporting percentiles
from the far out, large value tail. Would e.g., the reported percentiles be
placed more evenly it should be possible to recognize bands. After all
percentiles are just samples from the cumulative distribution from which one
can derived the underlying distribution (with some resolution) by taking a
derivative.

Note that a poorly binned and ranged histogram can obscure features as well.
Reporting percentiles/quantiles has the advantage of adapting to the data
automatically.


Cheers,

Benjamin


Re: Moving the website repo from svn to git

2017-06-01 Thread Benjamin Bannier
Hi Vinod,

> *Implementation details: *
> 
> We have an option to move to
> 1) a standalone git repo (say "mesos-site") which will be mirrored on
> github.
> 2) just use our "mesos" git repo and publish a "asf-site" branch with
> website contents (say at 'site/publish' directory)
> 
> I'm leaning towards 2) because that allows us to deal with single repo
> instead of two.

I have never updated the website so I cannot comment on the pain involved.

As a user of the Mesos source git repository I would however like to bring up 
that _all_ of the website’s assets are generated from files present in the 
source repository (at some point in time). The largest fraction of the 
`publish` directory is Doxygen documentation (currently >90% at ~100 MB). We 
should weigh the effect this would have for developers should we add this 
content to the Mesos source repository.

To get a ballpark idea I imported the website’s history into a git repository. 
After the initial import its `.git` directory contained ~100 MB which went down 
to ~30MB after aggressive repository repacking. A fresh clone of the Mesos 
source repository amounts to ~280 MB, so it seems we would add at least 10% to 
the repositories size with little benefit to developers. Depending on the 
implementation, this number would likely increase would we e.g., provide 
version-dependent website content, or introduce website asset formats not 
compressing as nicely with git (e.g., generated graphics).

I have the feeling keeping this content in a separate repository might strike a 
better balance for developers.


Benjamin



Re: Parallel test runner added

2017-05-02 Thread Benjamin Bannier
Hi again,

I looked at the currently committed parallel test execution tooling and 
summarize existing solutions for machines with many cores below.

I would still be very much interested to know how the existing defaults perform 
for typical dev setups. Every additional data point would be very much 
appreciated.

* * *

Our autotools tooling does declare a autotools variable `TEST_DRIVER` which can 
be used to specify an alternative test driver invocation. Assuming one is in a 
directory `build/` directly under the main Mesos checkout one can invoke

$ ../configure TEST_DRIVER="$PWD/../support/mesos-gtest-runner.py -j10” 
—enable-parallel-test-execution

to bake a maximal concurrency of 10 into the build setup.

For an already configured setup one would specify flags with

$ make check TEST_DRIVER="$PWD/../support/mesos-gtest-runner.py -j10”

To always have a fixed concurrency one could declare an environment variable 
`TEST_DRIVER` setting a test driver and its args; `./configure` will pick up 
its value and bake it into the build setup so enable parallel test execution 
would always use this driver setup.


Under the covers the build setup invokes

% ${TEST_DRIVER} ./src/mesos-tests

i.e. something like,

% ../support/mesos-gtest-runner.py ./src/mesos-tests

so one can experiment with different concurrency levels to decide on an 
acceptable operating point for the concurrency by directly prefixing the test 
invocation with some driver setup. The test runner has help strings documenting 
the understood parameters.


HTH,

Benjamin




> On Apr 29, 2017, at 8:26 AM, Benjamin Bannier 
> <benjamin.bann...@mesosphere.io> wrote:
> 
> Hi Ben,
> 
> I use the parallel exclusively on a 8 hyperthreads Mac OS machine and a 16 
> core Fedora box. For me only known flaky tests fail.
> 
> Currently the target parallelity is calculated rather naively and can e.g. 
> grow without bound which will become an issue on machines with many cores. I 
> would also be curious to know how the current defaults perform for "typical" 
> setups. Every additional data point would help us deciding on the best way 
> forward.
> 
> I can take on proposing a patch to improve the situation for machines with 
> many cores after the weekend. 
> 
> 
> Cheers,
> 
> Benjamin 
> 
>> Am 29.04.2017 um 01:12 schrieb Benjamin Mahler <bmah...@apache.org>:
>> 
>> Is anyone using the parallel test runner? I did another test of it today
>> and it triggered 278 failing tests. I noticed a lot of timeouts so I tried
>> bumping the default wait time from 15 seconds to 120 seconds. That brought
>> it down to 43 failures.
>> 
>> Taking a look at the remaining failures, it seems it is going too wide on
>> my system (the system has 12 core, 24 hyperthreads, although I see 48
>> entries in /proc/cpuinfo):
>> 
>> [--] 1 test from DiskQuotaTest
>> [ RUN  ] DiskQuotaTest.SlaveRecovery
>> /home/bmahler/git/mesos/build/src/mesos-containerizer: fork: retry:
>> Resource temporarily unavailable
>> terminate called after throwing an instance of 'std::system_error'
>> what():  Resource temporarily unavailable
>> ../../src/tests/disk_quota_tests.cpp:666: Failure
>> Value of: status->state()
>> Actual: TASK_FAILED
>> Expected: TASK_RUNNING
>> ../../src/tests/disk_quota_tests.cpp:671: Failure
>> Value of: containers->size()
>> Actual: 0
>> Expected: 1u
>> Which is: 1
>> [  FAILED  ] DiskQuotaTest.SlaveRecovery (1636 ms)
>> [--] 1 test from DiskQuotaTest (1638 ms total)
>> 
>> [--] 1 test from FetcherTest
>> [ RUN  ] FetcherTest.UNZIP_ExtractFileWithDuplicatedEntries
>> ../../src/tests/fetcher_tests.cpp:911: Failure
>> (fetch).failure(): Failed to execute mesos-fetcher: Failed to clone:
>> Resource temporarily unavailable
>> [  FAILED  ] FetcherTest.UNZIP_ExtractFileWithDuplicatedEntries (8 ms)
>> [--] 1 test from FetcherTest (8 ms total)
>> 
>> It seems we should constrain how wide it goes, as well as restrict the
>> number of worker threads libprocess uses in each instance.
>> 
>>> On Thu, Oct 13, 2016 at 3:51 PM, Michael Park <mp...@apache.org> wrote:
>>> 
>>> Thanks for pushing this through Benjamin!
>>> 
>>> I understand if you're unable to attend the community sync on the 20th,
>>> but would you be able to present this as a demo somehow? maybe via a
>>> screencast?
>>> 
>>> MPark
>>> 
>>> On Thu, Oct 13, 2016 at 6:33 PM, Benjamin Mahler <bmah...@apache.org>
>>> wrote:
>>> 
>>>> Great to see this Benjamin!
>>>> 
>>

Re: Parallel test runner added

2017-04-29 Thread Benjamin Bannier
Hi Ben,

I use the parallel exclusively on a 8 hyperthreads Mac OS machine and a 16 core 
Fedora box. For me only known flaky tests fail.

Currently the target parallelity is calculated rather naively and can e.g. grow 
without bound which will become an issue on machines with many cores. I would 
also be curious to know how the current defaults perform for "typical" setups. 
Every additional data point would help us deciding on the best way forward.

I can take on proposing a patch to improve the situation for machines with many 
cores after the weekend. 


Cheers,

Benjamin 

> Am 29.04.2017 um 01:12 schrieb Benjamin Mahler <bmah...@apache.org>:
> 
> Is anyone using the parallel test runner? I did another test of it today
> and it triggered 278 failing tests. I noticed a lot of timeouts so I tried
> bumping the default wait time from 15 seconds to 120 seconds. That brought
> it down to 43 failures.
> 
> Taking a look at the remaining failures, it seems it is going too wide on
> my system (the system has 12 core, 24 hyperthreads, although I see 48
> entries in /proc/cpuinfo):
> 
> [--] 1 test from DiskQuotaTest
> [ RUN  ] DiskQuotaTest.SlaveRecovery
> /home/bmahler/git/mesos/build/src/mesos-containerizer: fork: retry:
> Resource temporarily unavailable
> terminate called after throwing an instance of 'std::system_error'
>  what():  Resource temporarily unavailable
> ../../src/tests/disk_quota_tests.cpp:666: Failure
> Value of: status->state()
>  Actual: TASK_FAILED
> Expected: TASK_RUNNING
> ../../src/tests/disk_quota_tests.cpp:671: Failure
> Value of: containers->size()
>  Actual: 0
> Expected: 1u
> Which is: 1
> [  FAILED  ] DiskQuotaTest.SlaveRecovery (1636 ms)
> [--] 1 test from DiskQuotaTest (1638 ms total)
> 
> [--] 1 test from FetcherTest
> [ RUN  ] FetcherTest.UNZIP_ExtractFileWithDuplicatedEntries
> ../../src/tests/fetcher_tests.cpp:911: Failure
> (fetch).failure(): Failed to execute mesos-fetcher: Failed to clone:
> Resource temporarily unavailable
> [  FAILED  ] FetcherTest.UNZIP_ExtractFileWithDuplicatedEntries (8 ms)
> [--] 1 test from FetcherTest (8 ms total)
> 
> It seems we should constrain how wide it goes, as well as restrict the
> number of worker threads libprocess uses in each instance.
> 
>> On Thu, Oct 13, 2016 at 3:51 PM, Michael Park <mp...@apache.org> wrote:
>> 
>> Thanks for pushing this through Benjamin!
>> 
>> I understand if you're unable to attend the community sync on the 20th,
>> but would you be able to present this as a demo somehow? maybe via a
>> screencast?
>> 
>> MPark
>> 
>> On Thu, Oct 13, 2016 at 6:33 PM, Benjamin Mahler <bmah...@apache.org>
>> wrote:
>> 
>>> Great to see this Benjamin!
>>> 
>>> Looking forward to seeing the parallel test runner turn green, I'll help
>>> file tickets under the epic (I see there are a lot of test failures for
>>> me).
>>> 
>>> Once we clear the issues and turn it green, shall we make this the
>> default?
>>> I would be in favor of that.
>>> 
>>> Ben
>>> 
>>> On Thu, Oct 13, 2016 at 2:28 PM, Benjamin Bannier <
>>> benjamin.bann...@mesosphere.io> wrote:
>>> 
>>>> 
>>>> Hi,
>>>> 
>>>> Since most tests in the Mesos, libprocess, and stout test suites can
>>>> be executed in parallel (the exception being some `ROOT` tests with
>>>> global side effects in Mesos), we recently added a parallel test
>>>> runner `support/mesos-gtest-runner.py`. This should allow to
>>>> potentially significantly speed up running of test suites.
>>>> 
>>>> To enable automatic parallel execution of tests for test targets
>>>> executed during `make check`, configure Mesos with the option
>>>> `--enable-parallel-test-execution`. This will configure the test
>> runner
>>>> to run all tests but the `ROOT` tests in parallel; `ROOT` tests will
>>>> be run in a separate, sequential step.
>>>> 
>>>> * * *
>>>> 
>>>> We use the environment variable `TEST_DRIVER` to drive parallel test
>>>> execution. By setting this variable to an empty string you can
>>>> temporarily disable configured parallel execution, e.g.,
>>>> 
>>>>% make check TEST_DRIVER=
>>>> 
>>>> By setting this environment variable you have control over the test
>>>> runner itself and its arguments, even without enabling parallel test
>>>> during `./configure` time. Be aware that many 

Re: Exponential Backoff

2017-02-16 Thread Benjamin Bannier
Hi Anindya,

thanks for that nice systematic write up. It makes it pretty clear that there 
are some inconsistency how back-off is handled, and how a more systematic 
approach could help.

I’d like to make a small remark here where I can use some more space than in 
the doc.

>> On Feb 12, 2017, at 9:03 PM, Anindya Sinha  wrote:
>> 
>> Reference: https://issues.apache.org/jira/browse/MESOS-7087 
>> 
>> 
>> Currently, we have at least 3 types of backoff such as:
>> 1) Exponential backoff with randomness, as in framework/agent registration.
>> 2) Exponential backoff with no randomness, as in status updates.
>> 3) Linear backoff with randomness, as in executor registration.

We had a small water cooler discussion about this, and were wondering if it 
would be worthwhile to also take the possibility of globally rate-limiting 
certain request kinds into account, e.g., of framework/agent registration 
requests regardless of the source. This might lead to improvements for any kind 
of activity caused by state changes affecting a large number of agents or 
frameworks. I give a more technical example below.

Also, I believe when evaluating improvements to back-off, it would a good idea 
to examine the expected time difference between arrivals of messages from 
different actors as a function of the back-off procedure as a benchmark (either 
by checking the theoretical literature or by performing small Monte Carlo 
simulations).


Cheers,

Benjamin


* * * 

# Technical example related to (1) above

Let’s say the following happens:

- A master failover occurs.
- All agents realize this pretty much simultaneously.
- All agents pretty much simultaneously start a registration procedure with the 
new master.

Now if there were no extra randomness introduced into the back-off (but there 
is) the master would see registration attempts from all agents pretty much at 
the same time. In large clusters this could flood the master beyond its 
abilities to handle these requests timely. That we deterministically space out 
registration attempts by larger and larger times wouldn’t help the master much 
when he’d have to deal with massive simultaneous registration load. 
Effectively, the agents inadvertently might still be performing something like 
a coordinated DDOS attack on the master by all retrying after the same time. 
Technically, the underlying issues is that the expected time difference between 
arrival times of registration attempts from different agents at the master 
would still be a Dirac delta function (think: pulse function with zero width 
sitting at zero).

Currently, the only tool protecting the master from having to handle a large 
number of registration attempts is the extra randomness we insert at the sender 
site. We pull this randomness from a uniform distribution. A uniform 
distribution is a great choice here since for a uniform distribution the tails 
of the distribution are as fat as they can get. Fat tails lead to a wider 
arrival time difference distribution at the master (it is a symmetric 
triangular distribution now instead of a delta function, still centered around 
zero though). A wider arrival time distribution means that the the probability 
of registration attempts from different agents arriving close in time is 
lowered; this is great as it potentially gives the master more time to handle 
all the requests.

The remaining issue is that even though we have spaced out requests in time by 
introducing randomness at the source, the most likely time difference between 
arrivals of two messages would still be zero (that’s just a consequence of 
statistics, the distribution for the difference of two independent random 
numbers from the same distribution is symmetric and centered around zero). We 
just have shifted some probability from smaller to larger time differences, but 
for sufficiently large clusters a master might still need to handle many more 
messages than it realistically can. Note that we use randomness at the source 
to space out requests from each other (independent random numbers), and that 
there might be no entity which could coordinate agents to collaboratively space 
out their requests more favorably for the master, e.g., in master failover 
there would be no master to coordinate the agents’ behavior.

I believe one possible solution for this would be to back pressure by the 
master rate limiting messages *before it becomes overloaded* (e.g., decided by 
examining something like the process’ message queue size or the average time a 
message stays in the queue, and dropping requests before performing any real 
work on them). This would force clients into another backoff iteration which 
would additionally space out requests.

Re: Proposal for Mesos Build Improvements

2017-02-15 Thread Benjamin Bannier
Hi,

> I wonder if we should instead use headers like:
> 
> <- mesos_common.h ->
> #include 
> #include 
> #include 
> 
> <- xyz.cpp, which needs headers "b" and "d" ->
> #include "mesos_common.h>
> 
> #include 
> #include 
> 
> That way, the fact that "xyz.cpp" logically depends on  (but not
>  or ) is not obscured (in other words, Mesos should continue to
> compile if 'mesos_common.h' is replaced with an empty file).

That’s an interesting angle for a number of reasons. It would allow local 
reasoning about correct includes, and it also appears to help maintain support 
for ccache’d builds,

  https://ccache.samba.org/manual.html#_precompiled_headers

For that one could include project headers such as `mesos_common.h` via a 
command line switch to the compiler invocation, without the need to make any 
changes to source files (possibly an interesting way to create some 
benchmarking POC of this proposal).

Not changing source files for this would be valuable as it would keep build 
setup idiosyncrasies out of the source. If we wouldn’t change files we’d keep 
the possibility to make PCH use opt-in. Right now a ccache build of the Mesos 
source files and tests with warm ccache takes less than 50s on my 8 core 
machine (a substantial fraction of this time is spent in serializing 
(non-parallelizable) linking steps, and I’d bet there is also some ~10s 
overhead from Make stat’ing files and changing directories in there).

Generating precompiled headers would throw in additional serializing step, and 
even if it really only would take 20s to generate a header as guestimated by 
Jeff, we would already be approaching a point of diminishing returns on 
platforms with ccache, even if we compiled every source file in no time.

> Does anyone know whether the header guard in  _should_ make the repeated
> inclusion of  relatively cheap?

Not sure how much information gcc or clang would need to serialize from the 
PCH, but there is of course some form of multi-include optimization in both gcc 
and clang, see e.g.,

  https://gcc.gnu.org/onlinedocs/cppinternals/Guard-Macros.html


Cheers,

Benjamin

Tracking deprecated features

2017-02-07 Thread Benjamin Bannier
Hi,

we currently track deprecation of features largely through TODOs in the source 
code. Here we typically write down a release at which a deprecated feature 
should be removed.

I believe this is less than optimal since

* it is hard for users of our APIs to track when a deprecated feature is 
actually removed,
* it seems to encourage versioning-related discussions to happen in potentially 
low-visibility review requests instead of JIRA tickets,
* this approach can lead to wrong or misleading information in the code as our 
versioning policies evolve and mature, and
* poor trackability of outstanding deprecations leads to lots of missed 
opportunities to remove features already out of their deprecation cycle as we 
prepare releases.

I would like to propose to use JIRA for tracking deprecations instead.

A possible approach would be:

1) When a feature becomes deprecated, a JIRA ticket is created for its removal. 
The ticket can be referenced in the source code.
2) The ticket should be tagged with e.g. `deprecation`, and optimally link back 
to the ticket triggering the deprecation.
3) A target version is set in collaboration with maintainers of the versioning 
policy.
4) The release process is updated to involve bumping target versions of unfixed 
deprecation tickets to the following version.

I believe with this we would be able to better keep track and ultimately fix 
tech debt, as well as better improve communicating breaking to users.

Any thoughts?


Cheers,

Benjamin

Re: Order of includes

2016-12-16 Thread Benjamin Bannier
Hi,

> How does putting your own header at the top (vs. ~the bottom) help ensure
> "a header file always includes all symbols it requires”?


Given an incomplete header

// foo.hpp
std::string f();

// foo.cpp
#include “foo.hpp”
#include 

std::string f() { return {}; }

I get

% clang++ -fsyntax-only foo.cpp --std=c++11
In file included from foo.cpp:1:
./foo.hpp:1:1: error: use of undeclared identifier 'std'
std::string f();
^
1 error generated.

Swapping the include order makes this pass as `#include` is just textual 
replacement, and the `#include ` in `foo.cpp` would declare the symbol 
used in `foo.hpp`.


Cheers,

Benjamin

Re: Order of includes

2016-12-13 Thread Benjamin Bannier
Hi Yan,

I don’t feel too strongly about most of our style rules regarding include 
ordering since they are just about style.  

> For a cpp file foo.cpp, our style guide instructs folks to put the header
> foo.hpp at the top of the include list:
> https://github.com/apache/mesos/blob/master/docs/c%2B%2B-style-guide.md#order-of-includes
> 
> This is consistent with Google style guide but in reality most of the our
> files follow the rule of "treat foo.hpp the same way as other project
> headers”.

Among all our style rules regarding includes, this one actually does have a 
solid technical justification: It helps ensure that a header file always 
includes all symbols it requires (OK, possibly via discouraged transitive 
includes in the header itself). Not strictly following this rule has lead to 
broken header files making their way into the code base (both in the case of 
internal and public headers), see e.g.,

  https://reviews.apache.org/r/54083/
  https://reviews.apache.org/r/54084/
  https://reviews.apache.org/r/54083/

I’d rather have us follow a style that performs some automagic checking of 
header completeness than rely on humans to catch all issues.

Note that including `foo.hpp` first in `foo.cpp` is common practice, and I 
expect following this rule would lead to _less friction_ for newcomers to the 
Mesos code base, see e.g., (no particular order)

  http://llvm.org/docs/CodingStandards.html#include-style
  
https://github.com/bloomberg/bde/wiki/physical-code-organization#component-design-rules
  https://webkit.org/code-style-guidelines/#include-statements
  
https://github.com/facebook/hhvm/blob/master/hphp/doc/coding-conventions.md#what-to-include
  https://google.github.io/styleguide/cppguide.html#Names_and_Order_of_Includes


Cheers,

Benjamin

Re: New Defects reported by Coverity Scan for Mesos

2016-12-06 Thread Benjamin Bannier
Hi,

I filed https://issues.apache.org/jira/browse/MESOS-6726 to address this issue 
in particular and https://issues.apache.org/jira/browse/MESOS-6727 to more 
generally remove the dangerous overload used here.

HTH,

Benjamin

> On Dec 6, 2016, at 4:40 AM, scan-ad...@coverity.com wrote:
> 
> 
> Hi,
> 
> Please find the latest report on new defect(s) introduced to Mesos found with 
> Coverity Scan.
> 
> 1 new defect(s) introduced to Mesos found with Coverity Scan.
> 2 defect(s), reported by Coverity Scan earlier, were marked fixed in the 
> recent build analyzed by Coverity Scan.
> 
> New defect(s) Reported-by: Coverity Scan
> Showing 1 of 1 defect(s)
> 
> 
> ** CID 1396866:  Uninitialized members  (UNINIT_CTOR)
> /src/slave/containerizer/mesos/io/switchboard.hpp: 216 in 
> mesos::internal::slave::IOSwitchboardServerFlags::IOSwitchboardServerFlags()()
> 
> 
> 
> *** CID 1396866:  Uninitialized members  (UNINIT_CTOR)
> /src/slave/containerizer/mesos/io/switchboard.hpp: 216 in 
> mesos::internal::slave::IOSwitchboardServerFlags::IOSwitchboardServerFlags()()
> 210 "first connection before reading any data from the 
> '*_from_fd's.");
> 211 
> 212 add(::socket_path,
> 213 "socket_address",
> 214 "The path of the unix domain socket this\n"
> 215 "io switchboard should attach itself to.");
CID 1396866:  Uninitialized members  (UNINIT_CTOR)
Non-static class member "wait_for_connection" is not initialized in 
 this constructor nor in any functions that it calls.
> 216   }
> 217 
> 218   bool tty;
> 219   int stdin_to_fd;
> 220   int stdout_from_fd;
> 221   int stdout_to_fd;
> 
> 
> 
> To view the defects in Coverity Scan visit, 
> https://u2389337.ct.sendgrid.net/wf/click?upn=08onrYu34A-2BWcWUl-2F-2BfV0V05UPxvVjWch-2Bd2MGckcRZ-2B0hUmbDL5L44V5w491gwGCJEE339V3aTW7x9nwB-2BHtQ-3D-3D_GNnPkJalgkEpe7D7Qaq3CrPne-2BTvAT-2Fi7n61dNNZWw0LT4UjIw54ej3jbmv-2FYiVXDjJUsA9QVMQvV4Sfsby3m0PwzOcH-2BQVR0-2BM9L8SQ2-2ByKpcrAY-2FYrBhypVx90UimTuFH82MOmDuacMPl09f6qGnwqiYMgAuAXQkeP7xe5fFt4FXW-2FNXD9FQr81wjFJweyVjgohEE-2FJHoC5FopCDKpNlr8mzY3OG5TRegXnTnrKag-3D
> 
> To manage Coverity Scan email notifications for 
> "benjamin.bann...@mesosphere.io", click 
> https://u2389337.ct.sendgrid.net/wf/click?upn=08onrYu34A-2BWcWUl-2F-2BfV0V05UPxvVjWch-2Bd2MGckcRbVDbis712qZDP-2FA8y06Nq4GpUVVgAxsS-2B56gradrUgiH-2FS-2F-2BfpPbwW1sDtnGg27oOAryn3RWaGLxkl6Fxas54usbhxEUwvq9bIl7KdDUw8q8aXMKMCI9rIzEGsYnltdyQ-2FMVlHEhp-2BMeSMzsZQajR-2B_GNnPkJalgkEpe7D7Qaq3CrPne-2BTvAT-2Fi7n61dNNZWw0LT4UjIw54ej3jbmv-2FYiVXDjJUsA9QVMQvV4Sfsby3m4wQfYGKIemddc26xkSxwIo0zDWM9yoxoFVxAI7N4qiKkCsTaKeelbOyrafNfR3H7ZoRWhh6ZiiUl-2BeP7SZcJfyU5T2u-2BidZgdJLoy09J0KJDa8krt-2BXszVc9bLeJtXBpq5MYHZjGkso9fDIVZUQEy0-3D
> 



Re: [11/13] mesos git commit: Wired the libprocess code to use the streaming decoder.

2016-11-22 Thread Benjamin Bannier
Hi,

just came across this with our `mesos-this-capture` clang-tidy check:

> +// It is guaranteed that the continuation would run before the next
> +// request arrives. Also, it's fine to pass the `this` pointer to the
> +// continuation as this would get executed synchronously (if still 
> pending)
> +// from `SocketManager::finalize()` due to it closing all active sockets
> +// during libprocess finalization.
> +parse(*request)
> +  .onAny([this, socket, request](const Future& future) {

Even though there is a comment hinting that capturing `this` here should be
safe, I am not sure this is a maintainable solution, e.g., still working if we
begin to manage the lifetime of `process_manager` instead of simple leaking a
global object. Above code would continue to compile in that case, but become 
racy.

Is there some actor we could dispatch to to make this safe, or do we need a new 
abstraction?


Cheers,

Benjamin



Re: [3/3] mesos git commit: Enabled multiple field based authorization in the authorizer interface.

2016-11-17 Thread Benjamin Bannier
Hi,

This introduces a possibly uninitialized member `weight_info` which Coverity 
immediately detected. I filed MESOS-6604 for that. Could you please take that 
on @Alexander?


Cheers,

Benjamin

> On Nov 16, 2016, at 6:00 PM, m...@apache.org wrote:
> 
> Enabled multiple field based authorization in the authorizer interface.
> 
> Updates the authorizer interfaces and well as the local authorizer,
> such that all actions which were limited to use a _role_ or a
> _principal_ as an object, are able to use whole protobuf messages
> as objects. This change enables more sofisticated authorization
> mechanisms.
> 
> Review: https://reviews.apache.org/r/52600/
> 
> 
> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/bc0e6d7b
> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/bc0e6d7b
> Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/bc0e6d7b
> 
> Branch: refs/heads/master
> Commit: bc0e6d7b0b367e5ff67dd5f395e1e06938b02399
> Parents: 40c2e5f
> Author: Alexander Rojas 
> Authored: Tue Nov 15 19:04:25 2016 -0800
> Committer: Adam B 
> Committed: Wed Nov 16 01:55:03 2016 -0800
> 
> --
> include/mesos/authorizer/authorizer.hpp   |   6 +-
> include/mesos/authorizer/authorizer.proto |  54 
> src/authorizer/local/authorizer.cpp   | 115 +
> 3 files changed, 157 insertions(+), 18 deletions(-)
> --
> 
> 
> http://git-wip-us.apache.org/repos/asf/mesos/blob/bc0e6d7b/include/mesos/authorizer/authorizer.hpp
> --
> diff --git a/include/mesos/authorizer/authorizer.hpp 
> b/include/mesos/authorizer/authorizer.hpp
> index cb365c7..7217600 100644
> --- a/include/mesos/authorizer/authorizer.hpp
> +++ b/include/mesos/authorizer/authorizer.hpp
> @@ -61,7 +61,9 @@ public:
> task_info(object.has_task_info() ? _info() : nullptr),
> executor_info(
> object.has_executor_info() ? _info() : nullptr),
> -quota_info(object.has_quota_info() ? _info() : nullptr) 
> {}
> +quota_info(object.has_quota_info() ? _info() : nullptr),
> +weight_info(object.has_weight_info() ? _info() : 
> nullptr),
> +resource(object.has_resource() ? () : nullptr) {}
> 
> const std::string* value;
> const FrameworkInfo* framework_info;
> @@ -69,6 +71,8 @@ public:
> const TaskInfo* task_info;
> const ExecutorInfo* executor_info;
> const quota::QuotaInfo* quota_info;
> +const WeightInfo* weight_info;
> +const Resource* resource;
>   };
> 
>   /**
> 
> http://git-wip-us.apache.org/repos/asf/mesos/blob/bc0e6d7b/include/mesos/authorizer/authorizer.proto
> --
> diff --git a/include/mesos/authorizer/authorizer.proto 
> b/include/mesos/authorizer/authorizer.proto
> index b6a9f14..0696a62 100644
> --- a/include/mesos/authorizer/authorizer.proto
> +++ b/include/mesos/authorizer/authorizer.proto
> @@ -46,11 +46,17 @@ message Object {
>   optional TaskInfo task_info = 4;
>   optional ExecutorInfo executor_info = 5;
>   optional quota.QuotaInfo quota_info = 6;
> +  optional WeightInfo weight_info = 7;
> +  optional Resource resource = 8;
> }
> 
> 
> // List of authorizable actions supported in Mesos.
> +// NOTE: Values in this enum should be kept in
> +// numerical order to prevent accidental aliasing.
> enum Action {
> +  option allow_alias = true;
> +
>   // This must be the first enum value in this list, to
>   // ensure that if 'type' is not set, the default value
>   // is UNKNOWN. This enables enum values to be added
> @@ -58,19 +64,67 @@ enum Action {
>   UNKNOWN = 0;
> 
>   // Actions named *_WITH_foo may set a foo in `Object.value`.
> +
> +  // `REGISTER_FRAMEWORK` will have an object with `FrameworkInfo` set.
> +  // The `_WITH_ROLE` alias is deprecated and will be removed after
> +  // Mesos 1.2's deprecation cycle ends. The `value` field will continue
> +  // to be set until that time.
> +  REGISTER_FRAMEWORK = 1;
>   REGISTER_FRAMEWORK_WITH_ROLE = 1;
> 
>   // `RUN_TASK` will have an object with `FrameworkInfo` and `TaskInfo` set.
>   RUN_TASK = 2;
> 
> +  // `TEARDOWN_FRAMEWORK` will have an object with `FrameworkInfo` set.
> +  // The `_WITH_PRINCIPAL` alias is deprecated and will be removed after
> +  // Mesos 1.2's deprecation cycle ends. The `value` field will continue
> +  // to be set until that time.
> +  TEARDOWN_FRAMEWORK = 3;
>   TEARDOWN_FRAMEWORK_WITH_PRINCIPAL = 3;
> +
> +  // `RESERVE_RESOURCES` will have an object with `Resource` set.
> +  // The `_WITH_ROLE` alias is deprecated and will be removed after
> +  // Mesos 1.2's deprecation cycle ends. The `value` field will continue
> +  // to be set until that time.
> +  

Re: Build failed in Jenkins: Mesos » autotools,gcc,--verbose --enable-libevent --enable-ssl,GLOG_v=1 MESOS_VERBOSE=1,ubuntu:14.04,(docker||Hadoop)&&(!ubuntu-us1)&&(!ubuntu-6)&&(!ubuntu-eu2) #2933

2016-11-17 Thread Benjamin Bannier
Hi,

>> What do folks think about removing future timeouts in tests altogether?
>> Instead, we can time the whole suite differently on different CIs?

> Has there been any response from the ASF Infra folks on addressing the
> VM/hardware issues? Seems like it will be difficult to get good signal
> from the ASF CI in the absence of some improvements on the
> infrastructure side.

Alex brings up a valid way to largely decouple us from VM lag problems which 
seems to be mostly a problem since we expect actions in tests to finished 
faster than actual happing. The real, tested code would be much less aggressive 
in interpreting small response lags as fatal errors.

Would we set the default timeout for say `AWAIT_READY` in our test code to 
e.g., infinity, slow VMs would be much less an issue. To not indefinitely block 
machines for broken tests we probably should then either limit the duration of 
our Jenkins jobs (if ASF doesn’t already have that safeguard), or maybe even 
add that to our test execution setup itself (e.g., simply with `timeout(1)` or 
equivalents from the outside, or inside directly in the harness).

The downside of this is of course that a hanging test (e.g., due to some true 
race) could block execution of all other tests.

Being more patient can be helpful in other environments as well (e.g., 
`valgrind`).


Cheers,

Benjamin

Re: 0.28.3 release dashboard!

2016-11-07 Thread Benjamin Bannier
Hi Joseph and Anand,

> We are planning to cut this patch release within three workdays - that would 
> be around Monday next week. So, if you have any patches that need to get into 
> 0.28.3 make sure that either it is already in the 0.28.x branch or the 
> corresponding ticket has a target version set to 0.28.3.

There are still a number of rather unpleasant issues filed against 0.28 which 
are only fixed in versions > 0.28.3.

  https://issues.apache.org/jira/browse/MESOS-5224
  https://issues.apache.org/jira/browse/MESOS-5685
  https://issues.apache.org/jira/browse/MESOS-5727
  https://issues.apache.org/jira/browse/MESOS-5763
  https://issues.apache.org/jira/browse/MESOS-6391

Maybe it would be worthwhile to backport some of these.

FYI, I used the following query which still required some manual filtering:

project = Mesos AND \
affectedVersion in (0.28, 0.28.0, 0.28.1, 0.28.2, 0.28.3) AND \
(fixVersion not in (0.28.3) OR fixVersion < 0.28.3) AND \
status = Resolved and type = Bug

This might to be like a worthwhile addition to patch release dashboards (if 
somebody with more JIRA foo could come up with an actually working query).


Cheers,

Benjamin

Design doc for rlimit support in Mesos

2016-10-14 Thread Benjamin Bannier
Hi,

we are interested in exposing user resource limits (rlimits) to Mesos so 
executors can prepare environments for task with differing limit requirements. 
The design doc can be found here,


https://docs.google.com/document/d/148og6TlknWIG2d-VmyCG01eliiOGhNEc12mG4TWsfHU/edit?usp=sharing

Feedback welcome!


Cheers,

Benjamin

Parallel test runner added

2016-10-13 Thread Benjamin Bannier

Hi,

Since most tests in the Mesos, libprocess, and stout test suites can
be executed in parallel (the exception being some `ROOT` tests with
global side effects in Mesos), we recently added a parallel test
runner `support/mesos-gtest-runner.py`. This should allow to
potentially significantly speed up running of test suites.

To enable automatic parallel execution of tests for test targets
executed during `make check`, configure Mesos with the option
`--enable-parallel-test-execution`. This will configure the test runner
to run all tests but the `ROOT` tests in parallel; `ROOT` tests will
be run in a separate, sequential step.

* * *

We use the environment variable `TEST_DRIVER` to drive parallel test
execution. By setting this variable to an empty string you can
temporarily disable configured parallel execution, e.g.,

% make check TEST_DRIVER=

By setting this environment variable you have control over the test
runner itself and its arguments, even without enabling parallel test
during `./configure` time. Be aware that many `ROOT` tests cannot be
run in parallel.


The current settings oversubscribe the machine by running `#cores*1.5`
parallel jobs. This was driven by the observation that currently our
tests by and large do not make extended use of even a single core.
The number of parallel jobs can by controlled with the `-j` flag of
the test runner.

Since making more use of the machine will likely increase machine load
during test execution, running tests in parallel might expose test
flakiness. Tests might also fail to run in parallel if testcases e.g.,
write data to hardcoded locations or use hardcoded ports. Please file
JIRA tickets for such tests if they do not yet exist.


There is still some work needed to improve reporting from parallel
tests. We currently use a very silent mode if tests are running
without failures, and just report the logs of failed jobs in case of
failure. MESOS-6387 sketches out possible future improvements in this
area.


Happy testing,

Benjamin with help from Kevin & Till



Re: Separate Compilation of Tests

2016-09-27 Thread Benjamin Bannier
Hi,

being able to iterate more rapidly on tests sounds great.

I am slightly unsure about the cost of (i) linking even more binaries, and (ii)
the overhead of setting up the test environment for the invocations of test
binaries (I believe this was O(100ms) per `main` at some point).

I believe if one doesn't mind working with uncommitted changes one could
already now get half-way to you spot you desire by removing `SOURCE`
dependencies one doesn't care about from `mesos_tests`. At that point all one
is left with is the test case, the file containing the test `main`, and
infrastructure pieces. Since in the past we weren't super careful about cutting
these parts into components, figuring out the infrastructure code one actually
needs can be a bit tricky (and one would likely err on the side of pulling in
more than needed).

I think a slightly less disruptive and possibly incremental plan would be to
(1) clearly separate on the source level always-required infrastructure
functions and classes from more specific pieces, and (2) possibly moving
related pieces to convenience libraries. With that it should be possible to
quickly define new test programs with slimmed down dependencies while working
on a feature.  If we wanted we would still be able to add these to a single
project-wide binary like we do now when actually committing.

I think all work in this direction would be strictly cleanup, and could be done
without requiring us to change the way we perform tests, but along the way lay
the foundation for e.g., multiple test binaries, or allow us to expose parts of
the test infrastructure to outside users (e.g., tests of modules).


b.



Re: mesos git commit: Updated quota endpoint help.

2016-05-18 Thread Benjamin Bannier
Hi,

the way one currently has to manually regenerate markdown outputs which should 
then be checked in together (and ideally: atomically) with the corresponding 
source changes seems to be a reoccurring source of friction.

I understand that being able to e.g., reference the generated markdown outputs 
is useful, but believe the fundamentally right thing to do would be to generate 
the markdown outputs as part of the build and *not check them into source 
control*. If one would need to reference the endpoint help one could e.g., use 
links to https://mesos.apache.org/documentation/latest/endpoints/ and children.

Any reason this isn’t what we are already doing?


Cheers,

Benjamin


> On May 18, 2016, at 11:42 AM, haosdent  wrote:
> 
> Is it possible to show a warning in `./support/mesos-style.py` when commit
> changes contains "src/master/http.cpp" or "src/slave/http.cpp" while
> doesn't contain document changes?
> 
> On Wed, May 18, 2016 at 5:06 PM, Neil Conway  wrote:
> 
>> When modifying the endpoint help text, we should remember to update
>> the generated help files (via support/generate-endpoint-help.py) --
>> the changes to both the input text and generated output files should
>> be included as part of the same commit.
>> 
>> Neil
>> 
>> On Wed, May 18, 2016 at 10:58 AM,   wrote:
>>> Repository: mesos
>>> Updated Branches:
>>>  refs/heads/master a7835f889 -> 9f63d95f3
>>> 
>>> 
>>> Updated quota endpoint help.
>>> 
>>> 
>>> Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
>>> Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/9f63d95f
>>> Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/9f63d95f
>>> Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/9f63d95f
>>> 
>>> Branch: refs/heads/master
>>> Commit: 9f63d95f3cac17c94a7aff57980478263c78f6ee
>>> Parents: a7835f8
>>> Author: Adam B 
>>> Authored: Wed May 18 01:56:57 2016 -0700
>>> Committer: Adam B 
>>> Committed: Wed May 18 01:57:52 2016 -0700
>>> 
>>> --
>>> src/master/http.cpp | 11 ---
>>> 1 file changed, 8 insertions(+), 3 deletions(-)
>>> --
>>> 
>>> 
>>> 
>> http://git-wip-us.apache.org/repos/asf/mesos/blob/9f63d95f/src/master/http.cpp
>>> --
>>> diff --git a/src/master/http.cpp b/src/master/http.cpp
>>> index c4ca343..5d73a1d 100644
>>> --- a/src/master/http.cpp
>>> +++ b/src/master/http.cpp
>>> @@ -1286,15 +1286,20 @@ string Master::Http::QUOTA_HELP()
>>> {
>>>   return HELP(
>>> TLDR(
>>> -"Sets quota for a role."),
>>> +"Gets or updates quota for roles."),
>>> DESCRIPTION(
>>> -"Returns 200 OK when the quota has been changed successfully.",
>>> +"Returns 200 OK when the quota was queried or updated
>> successfully.",
>>> "Returns 307 TEMPORARY_REDIRECT redirect to the leading master
>> when",
>>> "current master is not the leader.",
>>> "Returns 503 SERVICE_UNAVAILABLE if the leading master cannot
>> be",
>>> "found.",
>>> +"GET: Returns the currently set quotas as JSON.",
>>> +"",
>>> "POST: Validates the request body as JSON",
>>> -" and sets quota for a role."),
>>> +" and sets quota for a role.",
>>> +"",
>>> +"DELETE: Validates the request body as JSON",
>>> +" and removes quota for a role."),
>>> AUTHENTICATION(true),
>>> AUTHORIZATION(
>>> "Using this endpoint to set a quota for a certain role requires
>> that",
>>> 
>> 
> 
> 
> 
> -- 
> Best Regards,
> Haosdent Huang



Re: Looking for shepherd (MESOS-4807)

2016-03-15 Thread Benjamin Bannier
Hi Yong,

> I am looking for shepherd to help me on MESOS-4807. 
> 
> https://issues.apache.org/jira/browse/MESOS-4807
> 
> This issue is similar to MESOS-4806 as both of them tries to fixes issues 
> that could parallelize the tests in mesos. Would appreciate if anyone could 
> shepherd to help me on this issue (MESOS-4807).

Thanks for taking on fixing this issue. Joris stepped up to shepherd your 
patch, and it looks like he already committed it.


Cheers,

Benjamin

Re: Request Mesos contributor role

2016-01-14 Thread Benjamin Bannier
Hi,

>> Error:
>> 2016-01-14 09:19:38 URL:https://reviews.apache.org/r/42288/diff/raw/ 
>> [612/612] -> "42288.patch" [1]
>> Total errors found: 0
>> Checking 1 files
>> Error: Commit message summary (the first line) must not exceed 72 characters.
> 
>> my patch first line is:
>> diff --git a/src/slave/containerizer/docker.cpp 
>> b/src/slave/containerizer/docker.cpp
> 
>> how could I to fix this?

This refers to the commit message,

Docker container REST API /monitor/statistics.json output have no timestamp 
field

which is too long (I count 81 chars, but a hard max is put at 72 chars); the 
same automated check also rejects commit summaries not ending in a period `.`. 
Additionally, a human reviewer will likely ask you to use past tense (e.g., 
“Fixed … for …”).

If you rerun `bootstrap` from the project root it should install local git 
hooks so that the same checks are run locally on your machine while you develop.


HTH,

Benjamin

Re: Using dolt instead of libel when possible

2016-01-08 Thread Benjamin Bannier
Hi,

> On Jan 5, 2016, at 8:08 PM, James Peach <jor...@gmail.com> wrote:
>> On Jan 5, 2016, at 12:59 AM, Benjamin Bannier 
>> <benjamin.bann...@mesosphere.io> wrote:
>> dolt is a replacement for libtool which promises to fix some performance 
>> issues of libtool, many of which have since dolt’s release landed in some 
>> versions of libtool.
> 
> Is dolt still maintained?

No, development has stopped, but we are talking about 180 lines of m4 code 
here, most of which are embedded shell script templates.

I have used dolt without issues for other projects in the past (mostly under 
some Linux), and sent this mail around to find out it it breaks builds for some 
systems we don’t test often.

>> I have made some first measurements of dolt under Debian8 (hardly any 
>> improvement) and OS X 10.10.5 (noticeable speed-up)
> 
> Which version of autoconf did you test on OS X?

This is GNU autoconf-2.69 from homebrew.


Cheers,

Benjamin

Using dolt instead of libel when possible

2016-01-05 Thread Benjamin Bannier
Hi,

dolt is a replacement for libtool which promises to fix some performance issues 
of libtool, many of which have since dolt’s release landed in some versions of 
libtool.

I have made some first measurements of dolt under Debian8 (hardly any 
improvement) and OS X 10.10.5 (noticeable speed-up), see

  https://issues.apache.org/jira/browse/MESOS-4271

While dolt should fallback to libtool if incompatibility with the build host is 
detected, it would be great if we could gather some more feedback on this 
change (e.g., does it horribly fail the build on your host, what speedup does 
it provide, …).


Cheers,

Benjamin

How do you use the fetcher cache?

2015-11-19 Thread Benjamin Bannier
Hi,

In mesos-0.23.0 we added support for caching fetched artifacts (as
described by `CommandInfo::URI`).

Here if caching was enabled for an URI re-downloading of known artifacts
could be avoided if the artifacts were still inside a slave-internal
LRU-style artifact cache.

Currently the caching layer does not support updating of cached entries,
but there exist a number of proposals how this functionality could be
added. We would be interested in your feedback on

* how you currently use artifact caching, e.g., do you use caching for
  all schemes, or only enable it selectively (for which schemes?), and
* your approaches/hacks to handle artifact updates.

There currently is a patch in review to refetch cached entries if a
remote URI's size or modification time changed since the last fetch, see
https://issues.apache.org/jira/browse/MESOS-3785. Since this approach
handles at least some use cases without requiring any user involvement,
we would be interested in knowing if such an automated approach would be
beneficial in general, or if it would still require workarounds for your
use case.

Alternatively, more explicit approaches could be taken to give users
more control. It is not clear to us at the moment how that could be
implemented for arbitrary remotes (the fetcher is currently also being
refactored into an interface to enable support for more protocols).


Thanks & cheers,

Benjamin


Re: Mesos Style Guideline Adjustments

2015-11-06 Thread Benjamin Bannier
Hi,

just to echo Alexander’s point, for newbies like me being able to delegate 
formatting decisions to tools as much as possible frees up a lot of mental 
resources for tackling the real issues.


Cheers,

Benjamin

ps. Also looking forward to an updated and expanded clang-format config.

 
> On Nov 6, 2015, at 1:44 PM, Alexander Rojas  wrote:
> 
> I think one of the main reasons we move to having 80 as the limit for both 
> code and comments is the ability it gives us to use tools (e.g. clang-format) 
> to enforce formatting rules, so personally I rather have us putting effort 
> towards that goal. On that note, the developer branch of clang-format allows 
> a much closer formatting options to the ones we use. On OS X it can be 
> installed using `brew install --HEAD clang-format`.
> 
> Right now I’m working on setting the config file to be as close as possible 
> to our style.
> 
>> On 06 Nov 2015, at 10:09, Alex Rukletsov  wrote:
>> 
>> I think jaggedness in the example you provide comes mainly from the fact
>> that the second comment has multiple logical blocks. I have formatted both
>> comments at 70 and at 80, here is the outcome: http://pastebin.com/nRQB0nCD
>> 
>> While the first comment indeed looks better when wrapped at 70, I can't say
>> the same about the second one.
>> 
>> I would say, that the longer a line could be, the less jagged the comment
>> block is. The ratio (`averageWordLength` / `maxLineLength`) approaches 0 as
>> `maxLineLenght` approaches infinity, which means wrapping a long word right
>> before the line end should be perceived less jagged : ).
>> 
>> Also, the longer an individual line can be, the less total lines are needed
>> for a comment block, which reduces jaggedness and makes code a little bit
>> more readable.
>> 
>> But my strongest argument is that having a separate soft rule for comments
>> is hard to enforce. I think what we can do is to encourage contributors /
>> committers to wrap comments in the most logical way—like the first comment
>> in the example you provide—even if the line length is not fully utilized.
>> Having said that, I would rather keep a single number: hard limit at 80 for
>> simplicity.
>> 
>> 
>> 
>> On Thu, Nov 5, 2015 at 10:15 PM, Benjamin Mahler 
>> wrote:
>> 
>>> This has come up in a couple of reviews, seems like we should add some soft
>>> guidelines around how to format comments for readability.
>>> 
>>> In particular, the reason that we wrapped at 70 in the past was for
>>> readability, so it would be great to continue doing so as a soft stylistic
>>> rule. The other thing we've been doing for readability is reducing
>>> "jaggedness" (variability in line lengths).
>>> 
>>> It would be great to establish these as soft rules and encourage new
>>> contributors / committers to follow them. Compare these two comments in
>>> Master::updateTask. The first one wraps at 70 and reduces jagedness, the
>>> second wraps at 80 and is more jagged:
>>> 
>>> https://github.com/apache/mesos/blob/0.25.0/src/master/master.cpp#L6057
>>> https://github.com/apache/mesos/blob/0.25.0/src/master/master.cpp#L6072
>>> 
>>> I can provide more examples to help clarify. If no one objects, I'll follow
>>> up with an update to the style guide. Thoughts appreciated!
>>> 
>>> On Thu, Sep 10, 2015 at 8:59 AM, Bernd Mathiske 
>>> wrote:
>>> 
 +1
> On Sep 10, 2015, at 4:21 PM, tommy xiao  wrote:
> 
> +1
> 
> 2015-09-10 9:44 GMT+08:00 Marco Massenzio :
> 
>> +1
>> 
>> 
>> 
>> 
>> Thanks, Michael!
>> 
>> 
>> 
>> —
>> Sent from my iPhone, which is not as good as you'd hope to fix trypos
>>> n
>> abbrvtn.
>> 
>> On Wed, Sep 9, 2015 at 6:23 PM, Michael Park 
>>> wrote:
>> 
>>> I've removed the 70 column restriction on comments from the style
 guide:
>>> 
>> 
 
>>> https://github.com/apache/mesos/commit/f9c2604ea97b91f8a9ec3b2863317761679b1c86
>>> Also, based on the comments, it seems like we should allow 80 column
>>> comments but omit the sweeping change.
>>> Thanks,
>>> MPark.
>>> On Wed, Aug 12, 2015 at 6:13 PM Marco Massenzio > wrote:
 On Wed, Aug 12, 2015 at 4:09 AM, Bernd Mathiske <
>>> be...@mesosphere.io>
 wrote:
 
> Like BenM,
> 
> +1 on allowing 80 column comments
> 
 +1
 (it really IS annoying having to keep an eye on the bottom column
>> counter
 when typing comments :)
 
 
> -1 on sweeping changes; incremental changes when touching old
 comments
> will do IMHO
> 
> +1 on the -1? :)
 Incremental changes are good and I doubt anyone will be "confused"
>>> by
>> them.
 
 
> Bernd
> 

Re: RFC: license headers interfere with doxygen documentation (MESOS-3581)

2015-10-23 Thread Benjamin Bannier
Hi,

thanks everyone for providing suggestions and feedback.

It seems we reached a consensus to implement option (a):

> (a) change *all* license headers to be wrapped in e.g. `/* .. */`, also 
> update the coding guidelines, or


and to keep improving the documentation in the code to provide more, helpful 
content.

We got 2 binding votes (+1 for (a)) from BenM and Joris, as well as 3 
non-binding votes from James, Marco, and myself, with no -1 for (a).

I will now propose a RR implementing the agreed solution.


Thanks again and cheers,

Benjamin 

Re: RFC: license headers interfere with doxygen documentation (MESOS-3581)

2015-10-21 Thread Benjamin Bannier
Hi Joseph,

yes, doing the right thing and having everything documented would make most of 
this cleaner.

There is still an issue with e.g. namespaces (or anything else the particular 
language allows to be extended later on):

{foo.hpp}
/** Licensed ..
*/

/** Foo is doxygenized!
*/
namespace foo {}

{foo/bar.hpp}
/** Licensed ..
*/

namespace foo {
/** Bar is doxygenized!
*/
struct Bar {};
}

Here the doxygen documentation for `foo` will contain both the license header, 
and the namespace doc, so to prevent implicit inclusion of license headers in 
the generated documentation one still needs to pick either of the original 
options.


Cheers,

Benjamin

  

> On Oct 20, 2015, at 11:49 PM, Joseph Wu <jos...@mesosphere.io> wrote:
> 
> +/- 0 (a) wouldn't hurt, but isn't the best solution.
> 
> 
> I'd vote for adding actual comment blocks to each class.  Doxygen takes the
> comment block immediately preceding the class and uses that as the
> description.  This means a file like this would show up correctly on
> Doxygen:
> 
> /**
> * License ...
> */
> 
> #include <...>
> 
> /**
> * Bar!  <- This is what would show up on Doxygen.
> * A lot of our existing classes don't have a comment block
> * so Doxygen takes the License instead :(
> */
> class Foo {
>  ...
> }
> 
> ~Joseph
> 
> On Tue, Oct 20, 2015 at 2:32 PM, Marco Massenzio <ma...@mesosphere.io>
> wrote:
> 
>> +1
>> (and thanks for flagging this!)
>> 
>> --
>> *Marco Massenzio*
>> Distributed Systems Engineer
>> http://codetrips.com
>> 
>> On Tue, Oct 20, 2015 at 12:14 PM, Joris Van Remoortere <
>> jo...@mesosphere.io>
>> wrote:
>> 
>>> +1 for (a).
>>> 
>>> 
>>> —
>>> *Joris Van Remoortere*
>>> Mesosphere
>>> 
>>> On Tue, Oct 20, 2015 at 3:02 PM, Benjamin Mahler <
>>> benjamin.mah...@gmail.com>
>>> wrote:
>>> 
>>>> +1 for (a), in this case the wide sweep only touches the license
>>> comments,
>>>> so it won't be disruptive to history.
>>>> 
>>>> On Tue, Oct 20, 2015 at 11:59 AM, James Peach <jor...@gmail.com>
>> wrote:
>>>> 
>>>>> 
>>>>>> On Oct 20, 2015, at 8:55 AM, Bernd Mathiske <be...@mesosphere.io>
>>>> wrote:
>>>>>> 
>>>>>> All, is changing every source code file prohibitive or not?
>>>>>> 
>>>>>>> On Oct 20, 2015, at 10:01 AM, Benjamin Bannier <
>>>>> benjamin.bann...@mesosphere.io> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I would like to ask for input on how we plan to fix (both short-
>> and
>>>>> longterm) the interference of the license headers and Doxygen
>>>> documentation
>>>>> (https://issues.apache.org/jira/browse/MESOS-3581).
>>>>>>> 
>>>>>>> Currently, and in line with the respective guidelines, license
>>> blocks
>>>>> are wrapped in Javadoc-style comments which are also used for Doxygen
>>>>> documentation. This leads to Doxygen interpreting license headers as
>>>>> documentation for whatever entity follows them in the code, and
>> heavily
>>>>> clutters the generated documentation (see e.g.
>>>>> http://mesos.apache.org/api/latest/c++/annotated.html). Given that
>>>>> considerable effort is done to improve the documentation this
>>>> unfortunate.
>>>>>>> 
>>>>>>> * * *
>>>>>>> 
>>>>>>> For a TLDR; of the Jira issue, there are two ways to fix this:
>>>>>>> 
>>>>>>> (a) change *all* license headers to be wrapped in e.g. `/* .. */`,
>>>> also
>>>>> update the coding guidelines, or
>>>>>>> (b) perform some preprocessor-like magic in the Doxygen layer.
>>>>>>> 
>>>>>>> Option (a) is very noise but obvious and stable; option (b) OTOH
>>>>> employs a simple but stupid text replacement under the covers
>> codified
>>> in
>>>>> the Doxygen config; it might produce some artifacts and be surprising
>>>> since
>>>>> the code Doxygen sees will be different from what is in the source.
>>>>>>> 
>>>>>>> I personally believe option (a) is superior for purely technical
>>>> reasons
>>>>> 
>>>>> +1 for (a); there's no value in showing license headers to doxygen or
>>>>> tooling workarounds
>>>>> 
>>>>>>> with option (b) a possible temporary workaround.
>>>>>>> 
>>>>>>> 
>>>>>>> To make sure that the generated documentation shows actual
>>>>> documentation content in overviews like
>>>>> http://mesos.apache.org/api/latest/c++/annotated.html and elsewhere
>> we
>>>>> should fix this. Please comment in the Jira issue (
>>>>> https://issues.apache.org/jira/browse/MESOS-3581) your input on how
>>> you
>>>>> think this should be fixed (short- and longterm).
>>>>>>> 
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> 
>>>>>>> Benjamin
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 



RFC: license headers interfere with doxygen documentation (MESOS-3581)

2015-10-20 Thread Benjamin Bannier
Hi,

I would like to ask for input on how we plan to fix (both short- and longterm) 
the interference of the license headers and Doxygen documentation 
(https://issues.apache.org/jira/browse/MESOS-3581).

Currently, and in line with the respective guidelines, license blocks are 
wrapped in Javadoc-style comments which are also used for Doxygen 
documentation. This leads to Doxygen interpreting license headers as 
documentation for whatever entity follows them in the code, and heavily 
clutters the generated documentation (see e.g. 
http://mesos.apache.org/api/latest/c++/annotated.html). Given that considerable 
effort is done to improve the documentation this unfortunate.

* * *

For a TLDR; of the Jira issue, there are two ways to fix this:

(a) change *all* license headers to be wrapped in e.g. `/* .. */`, also update 
the coding guidelines, or
(b) perform some preprocessor-like magic in the Doxygen layer.

Option (a) is very noise but obvious and stable; option (b) OTOH employs a 
simple but stupid text replacement under the covers codified in the Doxygen 
config; it might produce some artifacts and be surprising since the code 
Doxygen sees will be different from what is in the source.

I personally believe option (a) is superior for purely technical reasons with 
option (b) a possible temporary workaround.


To make sure that the generated documentation shows actual documentation 
content in overviews like http://mesos.apache.org/api/latest/c++/annotated.html 
and elsewhere we should fix this. Please comment in the Jira issue 
(https://issues.apache.org/jira/browse/MESOS-3581) your input on how you think 
this should be fixed (short- and longterm).


Cheers,

Benjamin