Re: Subject: [VOTE] Release Apache Mesos 1.10.0 (rc1)

2020-05-27 Thread Greg Mann
+1 (binding)

Observed only known flakes in internal CI.

On Wed, May 27, 2020 at 9:56 AM Benjamin Mahler  wrote:

> +1 (binding)
>
> On Mon, May 18, 2020 at 4:36 PM Andrei Sekretenko 
> wrote:
>
>> Hi all,
>>
>> Please vote on releasing the following candidate as Apache Mesos 1.10.0.
>>
>> 1.10.0 includes the following major improvements:
>>
>> 
>> * support for resource bursting (setting task resource limits separately
>> from requests) on Linux
>> * ability for an executor to communicate with an agent via Unix domain
>> socket instead of TCP
>> * ability for operators to modify reservations via the RESERVE_RESOURCES
>> master API call
>> * performance improvements of V1 operator API read-only calls bringing
>> them on par with V0 HTTP endpoints
>> * ability for a scheduler to expect that effects of calls sent through
>> the same connection will not be reordered/interleaved by master
>>
>> NOTE: 1.10.0 includes a breaking change for custom authorizer modules.
>> Now, `ObjectApprover`s may be stored by Mesos indefinitely and must be
>> kept up-to-date by an authorizer throughout their lifetime.
>> This allowed for several bugfixes and performance improvements.
>>
>> The CHANGELOG for the release is available at:
>>
>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.10.0-rc1
>>
>> 
>>
>> The candidate for Mesos 1.10.0 release is available at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.10.0-rc1/mesos-1.10.0.tar.gz
>>
>> The tag to be voted on is 1.10.0-rc1:
>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.10.0-rc1
>>
>> The SHA512 checksum of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.10.0-rc1/mesos-1.10.0.tar.gz.sha512
>>
>> The signature of the tarball can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/mesos/1.10.0-rc1/mesos-1.10.0.tar.gz.asc
>>
>> The PGP key used to sign the release is here:
>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>
>> The JAR is in a staging repository here:
>> https://repository.apache.org/content/repositories/orgapachemesos-1259
>>
>> Please vote on releasing this package as Apache Mesos 1.10.0!
>>
>> The vote is open until Fri, May 21, 19:00 CEST  and passes if a majority
>> of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Mesos 1.10.0
>> [ ] -1 Do not release this package because ...
>>
>> Thanks,
>> Andrei Sekretenko
>>
>


Re: [VOTE] Release Apache Mesos 1.7.3 (rc1)

2020-05-19 Thread Greg Mann
Hi all,

The vote for Mesos 1.7.3 (rc1) has passed with the
following votes.

+1 (Binding)
--
Vinod Kone
Benjamin Mahler
Greg Mann


There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.7.3

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.3

The mesos-1.7.3.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks,
Greg Mann

On Fri, May 8, 2020 at 2:14 PM Greg Mann  wrote:

> +1 (binding)
>
> Ran in internal CI, also built manually on CentOS 7. Only known flaky test
> failures observed.
>
> On Thu, May 7, 2020 at 3:02 PM Benjamin Mahler  wrote:
>
>> +1 (binding)
>>
>> On Mon, May 4, 2020 at 1:48 PM Greg Mann  wrote:
>>
>> > Hi all,
>> >
>> > Please vote on releasing the following candidate as Apache Mesos 1.7.3.
>> >
>> > The CHANGELOG for the release is available at:
>> >
>> >
>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.3-rc1
>> >
>> >
>> 
>> >
>> > The candidate for Mesos 1.7.3 release is available at:
>> >
>> https://dist.apache.org/repos/dist/dev/mesos/1.7.3-rc1/mesos-1.7.3.tar.gz
>> >
>> > The tag to be voted on is 1.7.3-rc1:
>> > https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.3-rc1
>> >
>> > The SHA512 checksum of the tarball can be found at:
>> >
>> >
>> https://dist.apache.org/repos/dist/dev/mesos/1.7.3-rc1/mesos-1.7.3.tar.gz.sha512
>> >
>> > The signature of the tarball can be found at:
>> >
>> >
>> https://dist.apache.org/repos/dist/dev/mesos/1.7.3-rc1/mesos-1.7.3.tar.gz.asc
>> >
>> > The PGP key used to sign the release is here:
>> > https://dist.apache.org/repos/dist/release/mesos/KEYS
>> >
>> > The JAR is in a staging repository here:
>> > https://repository.apache.org/content/repositories/orgapachemesos-1258
>> >
>> > Please vote on releasing this package as Apache Mesos 1.7.3!
>> >
>> > The vote is open until Thu, May 7, 11:00 PDT 2020, and passes if a
>> majority
>> > of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Mesos 1.7.3
>> > [ ] -1 Do not release this package because ...
>> >
>> > Thanks,
>> > Greg Mann
>> >
>>
>


Re: [VOTE] Release Apache Mesos 1.7.3 (rc1)

2020-05-08 Thread Greg Mann
+1 (binding)

Ran in internal CI, also built manually on CentOS 7. Only known flaky test
failures observed.

On Thu, May 7, 2020 at 3:02 PM Benjamin Mahler  wrote:

> +1 (binding)
>
> On Mon, May 4, 2020 at 1:48 PM Greg Mann  wrote:
>
> > Hi all,
> >
> > Please vote on releasing the following candidate as Apache Mesos 1.7.3.
> >
> > The CHANGELOG for the release is available at:
> >
> >
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.3-rc1
> >
> >
> 
> >
> > The candidate for Mesos 1.7.3 release is available at:
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.7.3-rc1/mesos-1.7.3.tar.gz
> >
> > The tag to be voted on is 1.7.3-rc1:
> > https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.3-rc1
> >
> > The SHA512 checksum of the tarball can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.7.3-rc1/mesos-1.7.3.tar.gz.sha512
> >
> > The signature of the tarball can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.7.3-rc1/mesos-1.7.3.tar.gz.asc
> >
> > The PGP key used to sign the release is here:
> > https://dist.apache.org/repos/dist/release/mesos/KEYS
> >
> > The JAR is in a staging repository here:
> > https://repository.apache.org/content/repositories/orgapachemesos-1258
> >
> > Please vote on releasing this package as Apache Mesos 1.7.3!
> >
> > The vote is open until Thu, May 7, 11:00 PDT 2020, and passes if a
> majority
> > of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Mesos 1.7.3
> > [ ] -1 Do not release this package because ...
> >
> > Thanks,
> > Greg Mann
> >
>


[VOTE] Release Apache Mesos 1.7.3 (rc1)

2020-05-04 Thread Greg Mann
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.7.3.

The CHANGELOG for the release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.7.3-rc1


The candidate for Mesos 1.7.3 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.7.3-rc1/mesos-1.7.3.tar.gz

The tag to be voted on is 1.7.3-rc1:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.3-rc1

The SHA512 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.7.3-rc1/mesos-1.7.3.tar.gz.sha512

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.7.3-rc1/mesos-1.7.3.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1258

Please vote on releasing this package as Apache Mesos 1.7.3!

The vote is open until Thu, May 7, 11:00 PDT 2020, and passes if a majority
of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.7.3
[ ] -1 Do not release this package because ...

Thanks,
Greg Mann


Custom Docker executors

2020-04-22 Thread Greg Mann
Hi all,
I'd like to propose that we remove the code in the Docker containerizer
which allows schedulers to run custom executors in Docker containers. I
suspect that this feature may not be used at all currently, and its
presence in the containerizer leads to some added complexity. If you do run
custom executors in Docker containers please let me know, otherwise I will
likely go ahead and submit patches to make this change. If I do so, I'll
reply on this email thread with links to the patches so the community can
have a look.

Cheers,
Greg


RFC: Task Resource Limits Design

2019-11-22 Thread Greg Mann
Hi all,
Myself and a few others have been working recently on a design to support
task resource limits in Mesos. This feature allows schedulers to specify on
their tasks an upper limit for CPU and memory which is greater than the
requested CPU/mem resources. The task will be guaranteed access to the
requested resources, but it will be permitted to consume up to the limit.

I have a design doc here
,
please take a look and comment!

Thanks,
Greg


Re: Converting vm to task (performance degraded)

2019-08-29 Thread Greg Mann
Hi Marc,
Could you clarify the hardware configuration of the host? How many cores
does it have? Is this host the same one that you were running the VM on?
How many cores were allocated to the VM?

Regarding the 'top' output, if you're running 'top' with default settings I
would expect to see the nameserver process utilizing 100% of CPU, since
this would represent 1 core, so that does not match my expectations. Did
you supply any flags when running 'top'?

If this were something like a 6- or 8-core machine, and your previous VM
had been allocated all of the cores, then your numbers might make sense,
since you seem to be achieving approximately one sixth of your previous
throughput?

On Thu, Aug 29, 2019 at 5:59 AM Marc Roos  wrote:

>
> I am testing converting a nameserver vm to a task on mesos. If I query
> just one domain (so the results comes from cache) for 30 seconds I can
> do around 450.000 queries on the vm, and only 17.000 on the task.
> When I look at top output on the host where task is running I see this
> task only using 17% cpu time (vm allocates 100% cpu). I have launched
> the task with cpus: 1
>
> How/where/what should I check that causes this reduced performance?  I
> think some configuration is limiting because I can easily get 10k q/s on
> the vm and the task is only getting 1,8k q/s
>
> Is there a configuration guide on how to change a hosts settings to
> optimize it for using with mesos?
>
>
>
>
>
>
>
>
>
>


Re: [VOTE] Release Apache Mesos 1.8.1 (rc1)

2019-07-17 Thread Greg Mann
+1

'sudo make check' on CentOS

On Tue, Jul 16, 2019 at 4:56 PM Meng Zhu  wrote:

> +1
>
> tested on centos 7.4, only known flakies:
>
> [  PASSED  ] 466 tests.
> [  FAILED  ] 7 tests, listed below:
> [  FAILED  ] CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs
> [  FAILED  ] CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_Listen
> [  FAILED  ] DockerVolumeIsolatorTest.ROOT_CommandTaskNoRootfsWithVolumes
> [  FAILED  ] DockerVolumeIsolatorTest.ROOT_CommandTaskNoRootfsSlaveRecovery
> [  FAILED  ] DockerVolumeIsolatorTest.ROOT_EmptyCheckpointFileSlaveRecovery
> [  FAILED  ]
> DockerVolumeIsolatorTest.ROOT_CommandTaskNoRootfsSingleVolumeMultipleContainers
> [  FAILED  ]
> NvidiaGpuTest.ROOT_INTERNET_CURL_CGROUPS_NVIDIA_GPU_TensorflowGpuImage
>
> -Meng
>
> On Wed, Jul 10, 2019 at 1:48 PM Vinod Kone  wrote:
>
>> +1 (binding).
>>
>> Tested in ASF CI. One build failed due to known flaky test
>> https://issues.apache.org/jira/browse/MESOS-9594
>>
>>
>> *Revision*: 4ae06448466408d9ec96ede953208057609f0744
>>
>>- refs/tags/1.8.1-rc1
>>
>> Configuration Matrix gcc clang
>> centos:7 --verbose --disable-libtool-wrappers
>> --disable-parallel-test-execution --enable-libevent --enable-ssl autotools
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> [image: Not run]
>> cmake
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> [image: Not run]
>> --verbose --disable-libtool-wrappers --disable-parallel-test-execution
>> autotools
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> [image: Not run]
>> cmake
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> [image: Not run]
>> ubuntu:16.04 --verbose --disable-libtool-wrappers
>> --disable-parallel-test-execution --enable-libevent --enable-ssl autotools
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> cmake
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> [image: Failed]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> --verbose --disable-libtool-wrappers --disable-parallel-test-execution
>> autotools
>> [image: Success]
>> <
>> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/71/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
>> >
>> [image: Success]
>> <
>> 

[API WG] Meeting today cancelled

2019-07-09 Thread Greg Mann
Hi all,
We don't have anything on the agenda for today, so I'm going to cancel
today's meeting. Please add any items you might have for discussion to the
agenda here

if you'd like to discuss them next time!

Cheers,
Greg


Re: Design doc: Agent draining and deprecation of maintenance primitives

2019-06-14 Thread Greg Mann
or and we could easily patch it to have a custom sort
> mechanism. But I completely agree that optimistic offers or similar
> techniques are the way to go.
> >
> > I don't think that we will ever get to the point of having a reference
> scheduler, the Mesos community would need to agree on one implementation
> and make sure that every new feature of Mesos gets implemented in the
> scheduler. This is a huge amount of work and coordination/design. The
> mesosphere dcos-commons library is one example of the complexity of such a
> project, it is dedicated to stateful services, is clearly coupled with
> DC/OS (although we are able to use it on bare Mesos too), and it's still
> difficult to use. However, having an open source scheduler exposing a
> higher-level friendly API via RPC (like kubernetes for example), is
> probably the only way to make Mesos more accessible for most users.
> >
> > On Fri, Jun 7, 2019 at 6:24 AM Benjamin Mahler 
> wrote:
> > > With the new proposal, it's going to be as difficult as before to have
> SLA-aware maintenances because it will need cooperation from the frameworks
> anyway and we know this is rarely a priority for them. We will also lose
> the ability to signal future maintenance in order to optimize allocations.
> >
> > Personally, I think right now we should solve the basic need of draining
> a node. The plan to add SLA-awareness into draining was to introduce a
> capability that schedulers opt into that enables them to (1) take control
> over the killing of tasks when an agent is put into the draining state and
> (2) still get offers when an agent is the draining state in case the
> scheduler needs to restart a task that *must* run. This allows an SLA-aware
> scheduler to avoid killing during a drain if its task(s) will have SLAs
> violated.
> >
> > Perhaps this functionality can live alongside the maintenance schedule
> information we currently support, without being coupled together. As far as
> I'm aware that's something we hadn't considered (we considered integrating
> into the maintenance schedules or replacing them).
> >
> > > For example I had this idea to improve the allocator (or write a
> custom one) that would offer resources from agents with no maintenance
> planned in priority, and then sort agents by maintenance date in
> decremasing order.
> >
> > Right now there is no meaning to the order of offers. Adding some
> meaning to the ordering of offers quickly becomes an issue for us as soon
> as there are multiple criteria that need to be evaluated. For example, if
> you want to incorporate maintenance, load spreading, fault domain
> spreading, etc across machines, it becomes less clear how offers should be
> ordered. One could try to build some scoring model in mesos for ordering,
> but it will be woefully inadequate since Mesos does not know anything about
> the pending workloads: it's ultimately the schedulers that are best
> positioned to make these decisions. This is why we are going to move
> towards an "optimistic concurrency" model where schedulers can choose what
> they want and Mesos enforces constraints (e.g. quota limits), thereby
> eliminating the multi-scheduler scalability issues of the current offer
> model.
> >
> > And as somewhat of an aside, the lack of built-in scheduling has been
> bad for the Mesos ecosystem. The vast majority of users just need to
> schedule: services, jobs and cron jobs. These have a pretty standard look
> and feel (including the SLA aspect of them!). Many of the existing
> schedulers could be thinner "orchestrators" that know when to submit
> something to be scheduled by a common scheduler, rather than reimplementing
> all of the typical scheduling primitives (constraints, SLA awareness,
> dealing with the low level mesos scheduling API). My point here is that we
> ask too much of frameworks and it hurts users. I would love to see
> scheduling become more standardized and built into Mesos.
> >
> > On Thu, Jun 6, 2019 at 10:52 AM Greg Mann  wrote:
> > Maxime,
> > Thanks for the feedback, it's much appreciated. I agree that it would be
> possible to evolve the existing primitives to accomplish something similar
> to the proposal. That is one option that was considered before writing the
> design doc, but after some discussion, I thought that it seems more
> appropriate to start over with a simpler model that accomplishes what we
> perceive to be the predominant use case: the automated draining of agent
> nodes, without the concept of a maintenance window or designated
> maintenance time in the future. However, perhaps this perception is
> incorrect?
> >
> > Using maintenance metadata to alter the sorting

Re: Design doc: Agent draining and deprecation of maintenance primitives

2019-06-14 Thread Greg Mann
Hi all,
Myself and a few other committers spent some time revisiting the
possibility of implementing agent draining using maintenance windows, as
well as discussing the coexistence of the existing maintenance primitives
with the agent draining feature as it is currently designed. Ultimately,
the use case of an operator putting an agent into a draining state
immediately and indefinitely, with no concept of a maintenance window,
seems to be valid. That use case is a bit awkward to represent in terms of
our existing maintenance windows. So, our thought is that we can add the
agent draining feature as it is currently designed, in order to provide an
automatic agent draining primitive. We can then later on extend the
maintenance schedules to allow operators to specify that they would like to
automatically drain agents leading up to the maintenance window. At that
point, we could make use of the agent draining primitive to accomplish this.

For the time being, we would like to disallow any single agent from both
being present in the maintenance schedule and being put into an automatic
draining state. This gives us some time to figure out precisely how these
two features will interact so that we avoid the need to make breaking
changes down the road.

Let me know what you all think of the above plan. I like it because it
allows operators who are currently using the maintenance primitives to
continue doing so, accommodates the simple case of immediate agent draining
in the near future, and allows us to incorporate automatic draining into
the maintenance schedule later.

Cheers,
Greg

On Fri, Jun 14, 2019 at 4:18 PM Greg Mann  wrote:

> Christoph,
> Great to hear that you're using the maintenance primitives! It seems
> unwise for us to deprecate this part of the API given the fact that you and
> Maxime have both expressed a desire for it to stick around. I'll adjust the
> agent draining design doc to remove the deprecation of that feature. Many
> thanks for your feedback.
>
> Greg
>
> On Fri, Jun 7, 2019 at 9:24 PM Heer, Christoph 
> wrote:
>
>> Hi everyone,
>>
>> my team and I implemented our own Mesos framework for task execution on
>> our bare-metal on-prem cluster.
>> Especially for task processing workload with known or estimated task
>> duration, the available Mesos maintenance primitives are super powerful for
>> scheduler and operators. While developing the scheduler, I hadn't the
>> feeling it would be complex to support/respect maintenance windows. Already
>> the small logic "Should I launch task X with estimated runtime 3h on node Y
>> with scheduled maintenance in 40min?" saved us tons of aborted tasks. Our
>> hardware operations team also really likes the way to plan and express
>> maintenance windows upfront. Days before the actually maintenance they can
>> add the information and the node will be ready at that point in time. Also,
>> they can reboot the machines without the fear that any production workload
>> will be scheduled until they confirmed the end of the maintenance. But
>> looks like this would be also ensured by the new design.
>>
>> In the past we already used another job orchestration system with a
>> draining approach similar to the design proposal. In nearly all cases the
>> operations team didn't manage to start the draining mode at the right time.
>> Either it was too early, and we didn't use available hardware resources or
>> it was too late and it unnecessarily interrupted productive workload.
>> Especially for long-running tasks which are expensive at restarting, it
>> wasn't a good way to mange scheduled down times.
>>
>> I don't know the implementation within Mesos and therefore I can't judge
>> about the complexity but I think the main problem is that Mesos doesn't
>> provide an intuitive interface for managing maintenance windows. The HTTP
>> API isn't that complicated but you definitely need own or external tooling.
>> Probably most people are already deterred from the JSON syntax with
>> nanoseconds. Also, the lack of synchronisation of modifications can be a
>> problem and makes it harder to implement tooling around the API. A new more
>> fine-grain HTTP API would be a big improvement and would allow to implement
>> a nice looking interface within the Mesos UI.
>>
>> It would be sad to see this great feature disappearing.
>>
>> Best regards,
>> Christoph
>>
>>
>> Christoph Heer
>> SAP SE, Dietmar-Hopp-Allee 16, 69190 Walldorf, Germany
>>
>> Mandatory Disclosure Statement: www.sap.com/impressum
>> This e-mail may contain trade secrets or privileged, undisclosed, or
>> otherwise
>> confidential information. If you have received this e-mail in error

Re: Design doc: Agent draining and deprecation of maintenance primitives

2019-06-06 Thread Greg Mann
Maxime,
Thanks for the feedback, it's much appreciated. I agree that it would be
possible to evolve the existing primitives to accomplish something similar
to the proposal. That is one option that was considered before writing the
design doc, but after some discussion, I thought that it seems more
appropriate to start over with a simpler model that accomplishes what we
perceive to be the predominant use case: the automated draining of agent
nodes, without the concept of a maintenance window or designated
maintenance time in the future. However, perhaps this perception is
incorrect?

Using maintenance metadata to alter the sorting order in the allocator is
an interesting idea; currently, the allocator does not have access to
information about maintenance, but it's conceivable that we could extend
the allocator interface to accommodate this. While the currently-proposed
design would not allow this, it would allow operators to deactivate nodes,
which is an extreme version of this, since deactivated agents would never
have their resources offered to frameworks. This provides a blunt mechanism
to prevent scheduling on nodes which have upcoming maintenance, although it
sounds like you see some benefit to a more subtle notion of scheduling
priority based on upcoming maintenance? Do you think that maintenance-aware
sorting would provide much more benefit to you over agent deactivation? Do
you make use of the existing maintenance primitives to signal upcoming
maintenance on agents?

Thanks!
Greg

On Thu, Jun 6, 2019 at 9:37 AM Maxime Brugidou 
wrote:

> Hi,
>
> As a Mesos operator, I am really surprised by this proposal.
>
> The main advantage of the proposed design is that we can finally set nodes
> down for maintenance with a configurable kill grace period and a proper
> task status (with maintenance primitives, it was TASK_LOST I think) without
> any specific cooperation from the frameworks.
>
> I think that this could be just an evolution of the current primitives.
>
> With the new proposal, it's going to be as difficult as before to have
> SLA-aware maintenances because it will need cooperation from the frameworks
> anyway and we know this is rarely a priority for them. We will also lose
> the ability to signal future maintenance in order to optimize allocations.
>
> For example I had this idea to improve the allocator (or write a custom
> one) that would offer resources from agents with no maintenance planned in
> priority, and then sort agents by maintenance date in decremasing order.
> This would be a big improvement to prevent cluster reboots to trigger too
> many task restarts. This will not be possible with the new primitives. The
> same idea apply for frameworks too.
>
> Maxime
>
> Le jeu. 30 mai 2019 à 22:16, Joseph Wu  a écrit :
>
>> As far as I can tell, the document is public.
>>
>> On Thu, May 30, 2019 at 12:22 AM Marc Roos 
>> wrote:
>>
>>>
>>> Is the doc not public?
>>>
>>>
>>> -Original Message-
>>> From: Joseph Wu [mailto:jos...@mesosphere.io]
>>> Sent: donderdag 30 mei 2019 2:07
>>> To: dev; user
>>> Subject: Design doc: Agent draining and deprecation of maintenance
>>> primitives
>>>
>>> Hi all,
>>>
>>> A few years back, we added some constructs called maintenance primitives
>>> to Mesos.  This feature was meant to allow operators and frameworks to
>>> cooperate in draining tasks off nodes scheduled for maintenance.  As far
>>> as we've observed since, this feature never achieved enough adoption to
>>> be useful for operators.
>>>
>>> As such, we are proposing a more opinionated approach for draining
>>> tasks.  The goal is to have Mesos perform draining in lieu of
>>> frameworks, minimizing or eliminating the need to change frameworks to
>>> account for draining.  We will also be simplifying the operator
>>> workflow, which would only require a single call (holding an AgentID) to
>>> start draining; and a single call to bring an agent back into the
>>> cluster.
>>>
>>>
>>> Due to how closely this proposed feature overlaps with maintenance
>>> primitives, we will be deprecating maintenance primitives upon
>>> implementation of agent draining.
>>>
>>>
>>> If interested, please take a look at the design document:
>>>
>>>
>>> https://docs.google.com/document/d/1w3O80NFE6m52XNMv7EdXSO-1NebEs8opA8VZPG1tW0Y/
>>>
>>>
>>>


[API WG] Meeting today cancelled

2019-04-30 Thread Greg Mann
Hi all,
There are no agenda items for the API working group meeting today, so I'm
cancelling it.

The next meeting is scheduled for May 14, please add agenda items

if you have any planned or proposed API changes to discuss!

Cheers,
Greg


Planned change: disallow frameworks changing principals

2019-04-01 Thread Greg Mann
Hi all,
Due to MESOS-2842 , we
are planning to update the Mesos master to disallow frameworks from
changing their principal during reregistration. This will mean that over
the lifetime of a framework with a given framework ID, the framework will
only be able to use a single principal. Changing principals during
reregistration will currently cause the master to crash.

Furthermore, we persist framework principals in ReservationInfo and
DiskInfo in order to authorize UNRESERVE and DESTROY calls, so allowing the
framework to change its principal would interfere with authorization of
those actions.

If this change will cause any issues for you, please let us know. I've also
put this item on the agenda for tomorrow's API working group meeting at
11am PST; feel free to join and discuss!

Cheers,
Greg


Moving the 'mesos-rxjava' repo

2019-03-21 Thread Greg Mann
Hi all,
I wanted to announce a planned change to the mesos-rxjava
 project, which is currently
hosted in Mesosphere's github organization.

I'm planning to move this project into the unofficial Mesos github
organization  so that it can be maintained by the
community. I'll update the README of the existing repo to note that the
repository is deprecated, and will then remove the old repo entirely in a
couple weeks.

Please let me know if you have any questions or concerns about this plan!
Hopefully this will help the project by enabling community members outside
of Mesosphere to help with its maintenance.

Cheers,
Greg


Re: [API WG] Meeting tomorrow, call for agenda

2019-03-19 Thread Greg Mann
Looks like we have no agenda for the meeting today, so I'm cancelling it.
Next API working group meeting will be held on April 2, see you then!

On Mon, Mar 18, 2019 at 11:33 AM Greg Mann  wrote:

> Hi all!
> We have an API working group meeting scheduled for tomorrow, but as of now
> there are no items on the agenda
> <https://docs.google.com/document/d/1JrF7pA6gcBZ6iyeP5YgDG62ifn0cZIBWw1f_Ler6fLM/edit?usp=sharing>.
> If you have any planned or in-progress API changes that have not yet been
> discussed in the community, or any other API-related items for discussion,
> please add them to the agenda doc
> <https://docs.google.com/document/d/1JrF7pA6gcBZ6iyeP5YgDG62ifn0cZIBWw1f_Ler6fLM/edit?usp=sharing>
> !
>
> Thanks,
> Greg
>


[API WG] Meeting tomorrow, call for agenda

2019-03-18 Thread Greg Mann
Hi all!
We have an API working group meeting scheduled for tomorrow, but as of now
there are no items on the agenda
.
If you have any planned or in-progress API changes that have not yet been
discussed in the community, or any other API-related items for discussion,
please add them to the agenda doc

!

Thanks,
Greg


Re: [VOTE] Release Apache Mesos 1.5.3 (rc1)

2019-03-07 Thread Greg Mann
+1 (binding)

Ran through internal CI and observed only known flaky tests; almost all
configurations passed with no failures.

Cheers,
Greg

On Thu, Mar 7, 2019 at 1:55 AM Vinod Kone  wrote:

> +1 (binding)
>
> Ran in ASF CI. Saw some flaky tests but otherwise looks good.
>
> *Revision*: b1dbba03af23b0222d11f2b7ae936d77ef42650d
>
>- refs/tags/1.5.3-rc1
>
> Configuration Matrix gcc clang
> centos:7 --verbose --disable-libtool-wrappers
> --disable-parallel-test-execution --enable-libevent --enable-ssl autotools
> [image: Success]
> 
> [image: Not run]
> cmake
> [image: Success]
> 
> [image: Not run]
> --verbose --disable-libtool-wrappers --disable-parallel-test-execution
> autotools
> [image: Success]
> 
> [image: Not run]
> cmake
> [image: Success]
> 
> [image: Not run]
> ubuntu:16.04 --verbose --disable-libtool-wrappers
> --disable-parallel-test-execution --enable-libevent --enable-ssl autotools
> [image: Success]
> 
> [image: Success]
> 
> cmake
> [image: Success]
> 
> [image: Success]
> 
> --verbose --disable-libtool-wrappers --disable-parallel-test-execution
> autotools
> [image: Success]
> 
> [image: Success]
> 
> cmake
> [image: Success]
> 
> [image: Failed]
> 

[API WG] Meeting tomorrow

2019-03-04 Thread Greg Mann
Hi all,
The next API working group meeting will be held tomorrow, March 5, at 11am
PST.

We'll be discussing the planned update of scheduler API operation
reconciliation to a pattern more similar to task state reconciliation. *Note
that this will be a breaking change for schedulers which are currently
consuming the experimental operation feedback feature*, so if you are the
author of such a scheduler and have thoughts on this plan it would be great
to hear from you!

Please feel free to add other items to the agenda here

.

Thanks!
Greg


Re: Discussion: Scheduler API for Operation Reconciliation

2019-02-28 Thread Greg Mann
Hey folks,
Sorry to let this thread die out! I wanted to loop back and confirm our
planned approach. We would like to change the v1 scheduler API so that the
RECONCILE_OPERATIONS call no longer receives a synchronous HTTP response,
but instead results in an asynchronous stream of operation status updates
on the scheduler event stream. This mirrors what we currently do for task
reconciliation.

Feel free to chime in on this thread if you have any
questions/comments/concerns. I've added this item to the API working group
agenda
for
this coming Tuesday, March 5. Feel free to join that meeting to participate
in a discussion!

Cheers,
Greg

On Thu, Jan 24, 2019 at 7:10 PM Chun-Hung Hsiao 
wrote:

> I chatted with Jie and Gaston, and here is a brief summary:
>
> 1. The ordering issue between the synchronous response and the event stream
> would lead to extra complication for a framework, and thus the benefit
> doesn't seem to worth the complication.
> 2. However, we should consider not forwarding the reconciliation requests
> to the agents. The status updates doesn't require a trigger, and if the
> agent could report gone and unregistered RPs to the master, the master can
> respond to the reconciliation request itself.
> The only problem I see is that frameworks may see
> `OPERATION_GONE_BY_OPERATOR` -> `OPERATION_UNREACHABLE` ->
> `OPERATION_GONE_BY_OPERATOR`, since the master does not persist gone RPs.
>
> To address the original problem of MESOS-9318, we could do the following:
> (1) Agent is gone => `OPERATION_GONE_BY_OPERATOR`
> (2) Agent is unreachable => `OPERATION_UNREACHABLE`
> (3) Agent is not registered => `OPERATION_RECOVERING`
> (4) Agent is unknown => `OPERATION_UNKNOWN`
> (5) Agent is registered, RP is gone => `OPERATION_GONE_BY_OPERATOR`
> (6) Agent is registered, RP is not registered => `OPERATION_UNREACHABLE` or
> `OPERATION_RECOVERING`
> (7) Agent is registered, RP is unknown => `OPERATION_UNKNOWN`
> (8) Agent is registered, RP is registered => maybe `OPERATION_UNKNOWN`?
>
> So it seems a number of people agree with going with the asynchronous
> responses through the event stream. Please reply if you have other
> opinions!
>
> On Thu, Jan 24, 2019 at 1:39 PM James DeFelice 
> wrote:
>
> > I've attempted to implement support for operation status reconciliation
> in
> > a framework that I've been building. Option (III) seems most convenient
> > from my perspective as well. A single source of updates:
> >
> > (a) Leads to a cleaner framework design; I've had to poke a few holes in
> > the framework's initial design to deal with multiple event sources,
> leading
> > to increased complexity.
> >
> > (b) Allows frameworks to consume events in the order they arrive (and
> > pushes the responsibility for event ordering back to Mesos). Multiple
> event
> > sources that the framework needs to (possibly) reorder based on a
> timestamp
> > would add further complexity that we should avoid pushing onto framework
> > writers.
> >
> > Some other thoughts:
> >
> > (c) I've implemented a background polling loop for exactly the reason
> that
> > Benno pointed out. An asychronous API call for operation status
> > reconciliation would be fine with me.
> >
> > (d) API consistency is important. Framework devs are used to the way that
> > the task status reconciliation API works, and have come up with solutions
> > for dealing with the lack of boundaries for streams of explicit
> > reconciliation events. The synchronous response defined for the currently
> > published operation status reconciliation call isn't consistent with the
> > rest of the v1 scheduler API, which generated a bit of extra work (for
> me)
> > in the low-level mesos v1 http client lib. Consistency should be a
> primary
> > goal when extending existing API sets.
> >
> > (e) There are probably other ways to solve the problem of a "lack of
> > boundaries within the event stream" for explicit reconciliation requests.
> > If this is this a problem that other framework devs need solved then
> let's
> > address it as a separate issue - and aim to resolve it in a consistent
> way
> > for both task and operation status event streams.
> >
> > (f) It sounds like option (III) would let Mesos send back smarter
> > operation statuses in agent/RP failover cases (UNREACHABLE vs. UNKNOWN).
> > Anything to limit the number of scenarios where UNKNOWN is returned to
> > frameworks sounds good to me.
> >
> > -James
> >
> >
> >
> > On Wed, Jan 16, 2019 at 4:15 PM Benjamin Bannier <
> > benjamin.bann...@mesosphere.io> wrote:
> >
> >> Hi,
> >>
> >> have we reached a conclusion here?
> >>
> >> From the Mesos side of things I would be strongly in favor of proposal
> >> (III). This is not only consistent with what we do with task status
> >> updates, but also would allow us to provide improved operation status
> >> (e.g., `OPERATION_UNREACHABLE` instead of just 

[RESULT][VOTE] Release Apache Mesos 1.6.2 (rc1)

2019-02-25 Thread Greg Mann
Hi all,

The vote for Mesos 1.6.2 (rc1) has passed with the
following votes.

+1 (Binding)
--
Vinod Kone
Gastón Kleiman
Meng Zhu
Gilbert Song


There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.6.2

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.6.2

The mesos-1.6.2.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks!
Greg


Re: [VOTE] Release Apache Mesos 1.7.2 (rc1)

2019-02-21 Thread Greg Mann
+1

Built on CentOS 7.4 and ran all tests as root. Only 3 test failures were
observed, all known flakes.

Cheers,
Greg

On Wed, Feb 20, 2019 at 7:12 AM Vinod Kone  wrote:

> +1
>
> Ran this on ASF CI.
>
> The red builds are a flaky infra issue and a known flaky test
> .
>
> *Revision*: 58cc918e9acc2865bb07047d3d2dff156d1708b2
>
>- refs/tags/1.7.2-rc1
>
> Configuration Matrix gcc clang
> centos:7 --verbose --disable-libtool-wrappers
> --disable-parallel-test-execution --enable-libevent --enable-ssl autotools
> [image: Failed]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> [image: Not run]
> cmake
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> [image: Not run]
> --verbose --disable-libtool-wrappers --disable-parallel-test-execution
> autotools
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> [image: Not run]
> cmake
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> [image: Not run]
> ubuntu:16.04 --verbose --disable-libtool-wrappers
> --disable-parallel-test-execution --enable-libevent --enable-ssl autotools
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> cmake
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=cmake,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> --verbose --disable-libtool-wrappers --disable-parallel-test-execution
> autotools
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> cmake
> [image: Success]
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/66/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/
> >
> [image: Failed]
> <
> 

Re: [VOTE] Release Apache Mesos 1.6.2 (rc1)

2019-02-20 Thread Greg Mann
It appears to be a flaky test; that particular failure hasn't come up in
the CI builds that I ran, or in my own manual testing. Just now, I was able
to get that test to fail after many repetitions, but with a different
error. I filed ticket MESOS-9589
<https://issues.apache.org/jira/browse/MESOS-9589> to track.

Cheers,
Greg

On Tue, Feb 19, 2019 at 2:41 PM Vinod Kone  wrote:

> Found a flaky test
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/65/BUILDTOOL=cmake,COMPILER=gcc,CONFIGURATION=--verbose%20--disable-libtool-wrappers%20--disable-parallel-test-execution,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:16.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/console
> >in
> ASF CI. Doesn't seem to be a known issue according to JIRA.
>
> @Greg Mann   can you please confirm if this is a flaky
> test or something new?
>
>
>
> On Tue, Feb 19, 2019 at 1:56 PM Greg Mann  wrote:
>
> > Hi all,
> >
> > Please vote on releasing the following candidate as Apache Mesos 1.6.2.
> >
> >
> > 1.6.2 includes a number of bug fixes since 1.6.1; the CHANGELOG for the
> > release is available at:
> >
> >
> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.6.2-rc1
> >
> >
> 
> >
> > The candidate for Mesos 1.6.2 release is available at:
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.6.2-rc1/mesos-1.6.2.tar.gz
> >
> > The tag to be voted on is 1.6.2-rc1:
> > https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.6.2-rc1
> >
> > The SHA512 checksum of the tarball can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.6.2-rc1/mesos-1.6.2.tar.gz.sha512
> >
> > The signature of the tarball can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/mesos/1.6.2-rc1/mesos-1.6.2.tar.gz.asc
> >
> > The PGP key used to sign the release is here:
> > https://dist.apache.org/repos/dist/release/mesos/KEYS
> >
> > The JAR is in a staging repository here:
> > https://repository.apache.org/content/repositories/orgapachemesos-1246
> >
> > Please vote on releasing this package as Apache Mesos 1.6.2!
> >
> > The vote is open until Fri Feb 22 11:54 PST 2019, and passes if a
> majority
> > of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Mesos 1.6.2
> > [ ] -1 Do not release this package because ...
> >
> > Thanks,
> > Greg
> >
>


[VOTE] Release Apache Mesos 1.6.2 (rc1)

2019-02-19 Thread Greg Mann
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.6.2.


1.6.2 includes a number of bug fixes since 1.6.1; the CHANGELOG for the
release is available at:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.6.2-rc1


The candidate for Mesos 1.6.2 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.6.2-rc1/mesos-1.6.2.tar.gz

The tag to be voted on is 1.6.2-rc1:
https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.6.2-rc1

The SHA512 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.6.2-rc1/mesos-1.6.2.tar.gz.sha512

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.6.2-rc1/mesos-1.6.2.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1246

Please vote on releasing this package as Apache Mesos 1.6.2!

The vote is open until Fri Feb 22 11:54 PST 2019, and passes if a majority
of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.6.2
[ ] -1 Do not release this package because ...

Thanks,
Greg


[API Working Group] Meeting today cancelled

2019-02-19 Thread Greg Mann
Hi all,
We don't have any agenda items for the API working group meeting today, so
I'm cancelling it. The next meeting will be held on March 5 at 11am PST,
feel free to add items to the agenda here

!

Cheers,
Greg


[DISCUSSION] Making RESOURCE_PROVIDER capability required

2019-02-11 Thread Greg Mann
Hi all,
I'm working on an issue related to operation feedback on agent default
resources, MESOS-9535 .
This involves the master's handling of an agent capability that we recently
added, AGENT_OPERATION_FEEDBACK. This new capability is optional (i.e. not
in the agent's list of capabilities required for agent startup
),
and it has the RESOURCE_PROVIDER capability as a prerequisite.

I need to update the master code to avoid memory leaks in the case where an
agent is downgraded from AGENT_OPERATION_FEEDBACK-capable to
non-AGENT_OPERATION_FEEDBACK-capable. In this case, it is difficult for the
master to tell the difference between a true *version downgrade* to an
older agent, and a downgrade to a *recent agent* which has simply had the
capability unset by an operator.

To avoid this difficulty, I'm considering the possibility of making both
the RESOURCE_PROVIDER and AGENT_OPERATION_FEEDBACK capabilities required
for agent startup starting in 1.8.0. This would mean that operators could
no longer opt out of all of the new operation-handling code paths in the
master (`ApplyOperationMessage`, `UpdateOperationStatusMessage`, etc.).

I wanted to reach out to the community to see how folks feel about this
change, and also if there are any cluster operators out there who have been
disabling the RESOURCE_PROVIDER capability on their agents.

Thanks in advance for your input!

Cheers,
Greg


Re: [DISCUSS] Updating the support and release policy

2019-01-28 Thread Greg Mann
I'm fine with keeping old branches around and stating in our docs/READMEs
that the branches are unsupported. After looking at a few other open source
projects, it seems that this practice is not uncommon.

On Mon, Jan 14, 2019 at 11:42 AM Vinod Kone  wrote:

> Hi folks,
>
> As discussed in the Community WG meeting today, I wanted to send out a
> proposal for updating the current support and release policy
> .
>
> Context: According to our release policy, the latest released version and
> last 2 released versions are supported at any given time. With an expected
> timeline of a minor release every 3 months, that means a minor release is
> typically supported for 9 months. So far, we've indicated that a release is
> unsupported by deleting the corresponding release branch in our repository.
>
> The new proposal is as follows:
>
>- Keep the unsupported release branches and not delete them. Instead, we
>would make it clear in the CHANGELOG and also on the downloads
> page in our website which releases
>are supported and which are not.
>- If a committer would like to backport a fix to an unsupported release
>branch, they can do so. Such a backport is not required but a committer
> can
>do it if they wish. Contributor and committer should've a dialog
> regarding
>this.
>- CI will keep running against both supported and unsupported release
>branches  (as it is today) and any issues that might arise will be
> fixed on
>a best effort basis.
>- A committer can ask a contributor to submit a backport review incase
>the backport is complicated. Our review tooling (post-reviews and
>reviewbot) will be updated to make this possible.
>
> Based on our experience with the current policy in the last couple of years
> and the reality of how some of the organizations are using Mesos, we
> believe this tweaks will make it more practical and useful.
>
> Please let us know your thoughts by replying here or chatting in #community
> in our slack channel.
>
> Thanks,
> Vinod (on behalf of Community WG)
>


[API WG] Meeting today cancelled

2019-01-08 Thread Greg Mann
Hi all!
We don't have an agenda for the API working group meeting today, so I'm
cancelling it. Please add any proposed API changes or other items for
discussion to the agenda here

and we can discuss at the next meeting on Jan. 22.

Happy new year everyone!

Greg


[API WG] Meeting tomorrow

2018-12-10 Thread Greg Mann
Hi all,
The API working group will meet tomorrow, Dec. 11 at 11am PST. On the
agenda we have:

   - Proposed calls for the scheduler API:
  - UNSUPPRESS
  - CLEAR_FILTER
  - REQUEST_RESOURCE
  - Adding a new 'ResourceQuantity' type
   - Improving the scheduler operation reconciliation API


We will meet at this Zoom link: https://zoom.us/j/567559753
You can check out the agenda doc here

!

Cheers,
Greg


[API WG] Meeting today, 11am PST

2018-11-27 Thread Greg Mann
Hi all,
The API working group is meeting this morning at 11am PST. We'll be
discussing some potential changes to the agent's executor API to address
agent/executor communication issues, as well as an update to the master's
operator API to mitigate problems we've seen with Amazon's ELB. Feel free
to add other items of discussion to the agenda

!

Cheers,
Greg


Re: [API WG] Proposals for dealing with master subscriber leaks.

2018-11-21 Thread Greg Mann
Thanks for the proposal Joseph! I think I'm also leaning toward the
circular buffer approach. The one real concern there seems to be the
potential for "DDoS"-type scenarios when users hit the subscriber limit
using clients which have retry logic. Providing a metric for the number of
currently connected subscribers will hopefully help operators avoid this.

The default value for a new flag limiting subscriber count should be very
high (MAX_INT??) to maintain current behavior.

What do other folks think about this approach? Joseph's draft review is
here: https://reviews.apache.org/r/69307/

Greg

On Wed, Nov 14, 2018 at 6:35 PM Joseph Wu  wrote:

> Heartbeats are currently the least-liked solution, for precisely the
> reason BenM stated.  Clients of the API, such as the maintainers of the
> DC/OS UI, would also like to avoid making more connections than necessary
> and/or keeping additional state between connections.
>
>
> Currently, I am leaning towards keeping subscribers in a circular buffer.
> This solution is minimal in the code footprint and requires no client-side
> changes besides heavily incentivizing retry logic (which we already expect
> in most cases).
> One potential downside is having more subscribers than the (master flag)
> configured maximum.  In this case, each client would kick out the first
> few; which would then retry and kick out the next few, etc.  Each retry is
> equivalent to a GET /master/state, and the extra calls would basically
> erase the performance gains we have from streaming the events.
>
> Nevertheless, I think a reasonably high default would have minimal impact
> on both master performance and client connectivity.  The code for this
> proposal can be found here:
> https://reviews.apache.org/r/69307/  (Just one review)
>
> On Sun, Nov 11, 2018 at 9:22 AM Benjamin Mahler 
> wrote:
>
>> >- We can add heartbeats to the SUBSCRIBE call.
>> > This would need to be
>> >  part of a separate operator Call, because one platform (browsers) that
>> > might subscribe to the master does not support two-way streaming.
>>
>> This doesn't make sense to me, the heartbeats should still be part of the
>> same connection (request and response are infinite and heartbeating) by
>> default. Splitting into a separate call is messy and shouldn't be what we
>> force everyone to do, it should only be done in cases that it's impossible
>> to use a single connection (e.g. browsers).
>>
>> On Sat, Nov 10, 2018 at 12:03 AM Joseph Wu  wrote:
>>
>>> Hi all,
>>>
>>> During some internal scale testing, we noticed that, when Mesos streaming
>>> endpoints are accessed via certain proxies (or load balancers), the
>>> proxies
>>> might not close connections after they are complete.  For the Mesos
>>> master,
>>> which only has the /api/v1 SUBSCRIBE streaming endpoint, this can
>>> generate
>>> unnecessary authorization requests and affects performance.
>>>
>>> We are considering a few potential solutions:
>>>
>>>- We can add heartbeats to the SUBSCRIBE call.  This would need to be
>>>part of a separate operator Call, because one platform (browsers) that
>>>might subscribe to the master does not support two-way streaming.
>>>- We can add (optional) arguments to the SUBSCRIBE call, which tells
>>> the
>>>master to disconnect it after a while.  And the client would have to
>>> remake
>>>the connection every so often.
>>>- We can change the master to hold subscribers in a circular buffer,
>>> and
>>>disconnect the oldest ones if there are too many connections.
>>>
>>> We're tracking progress on this issue here:
>>> https://issues.apache.org/jira/browse/MESOS-9258
>>> Some prototypes of the code changes involved are also linked in the JIRA.
>>>
>>> Please chime in if you have any suggestions or if any of these options
>>> would be undesirable/bad,
>>> ~Joseph
>>>
>>


[API WG] Meeting cancelled - Oct. 30

2018-10-29 Thread Greg Mann
Hi all,
We currently have no agenda for the meeting tomorrow, and I'll be unable to
attend. For these reasons, I'd like to cancel this one. Our next meeting is
scheduled for Nov. 13 - see you then!

Cheers,
Greg


Re: [VOTE] Release Apache Mesos 1.5.2 (rc1)

2018-10-24 Thread Greg Mann
Hmm I wonder if this is an issue on 1.5.1, or perhaps introduced by this
commit? https://github.com/apache/mesos/commit/902aa34b79

On Wed, Oct 24, 2018 at 12:30 PM Vinod Kone  wrote:

> -1
>
> Tested on ASF CI. Looks like Clang builds are failing with a build error.
> See example build output
> <
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Release/55/BUILDTOOL=autotools,COMPILER=clang,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/console
> >
> below:
>
> libtool: compile:  clang++-3.5 -DPACKAGE_NAME=\"mesos\"
> -DPACKAGE_TARNAME=\"mesos\" -DPACKAGE_VERSION=\"1.5.2\"
> "-DPACKAGE_STRING=\"mesos 1.5.2\"" -DPACKAGE_BUGREPORT=\"\"
> -DPACKAGE_URL=\"\" -DPACKAGE=\"mesos\" -DVERSION=\"1.5.2\"
> -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1
> -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1
> -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1
> -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\"
> -DHAVE_CXX11=1 -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1
> -DHAVE_FTS_H=1 -DHAVE_APR_POOLS_H=1 -DHAVE_LIBAPR_1=1 -DHAVE_LIBCURL=1
> -DMESOS_HAS_JAVA=1 -DHAVE_EVENT2_EVENT_H=1 -DHAVE_LIBEVENT=1
> -DHAVE_EVENT2_THREAD_H=1 -DHAVE_LIBEVENT_PTHREADS=1 -DHAVE_LIBSASL2=1
> -DHAVE_OPENSSL_SSL_H=1 -DHAVE_EVENT2_BUFFEREVENT_SSL_H=1
> -DHAVE_LIBEVENT_OPENSSL=1 -DUSE_SSL_SOCKET=1 -DHAVE_SVN_VERSION_H=1
> -DHAVE_LIBSVN_SUBR_1=1 -DHAVE_SVN_DELTA_H=1 -DHAVE_LIBSVN_DELTA_1=1
> -DHAVE_ZLIB_H=1 -DHAVE_LIBZ=1 -DHAVE_PYTHON=\"2.7\"
> -DMESOS_HAS_PYTHON=1 -I. -I../../src -Werror
> -DLIBDIR=\"/mesos/mesos-1.5.2/_inst/lib\"
> -DPKGLIBEXECDIR=\"/mesos/mesos-1.5.2/_inst/libexec/mesos\"
> -DPKGDATADIR=\"/mesos/mesos-1.5.2/_inst/share/mesos\"
> -DPKGMODULEDIR=\"/mesos/mesos-1.5.2/_inst/lib/mesos/modules\"
> -I../../include -I../include -I../include/mesos -DPICOJSON_USE_INT64
> -D__STDC_FORMAT_MACROS -isystem ../3rdparty/boost-1.53.0 -isystem
> ../3rdparty/concurrentqueue-7b69a8f -I../3rdparty/elfio-3.2
> -I../3rdparty/glog-0.3.3/src -I../3rdparty/leveldb-1.19/include
> -I../../3rdparty/libprocess/include -I../3rdparty/nvml-352.79
> -I../3rdparty/picojson-1.3.0 -I../3rdparty/protobuf-3.5.0/src
> -I../../3rdparty/stout/include
> -I../3rdparty/zookeeper-3.4.8/src/c/include
> -I../3rdparty/zookeeper-3.4.8/src/c/generated -isystem
> /usr/include/subversion-1 -isystem /usr/include/apr-1 -isystem
> /usr/include/apr-1.0 -pthread -Wall -Wsign-compare -Wformat-security
> -fstack-protector-strong -fPIC -g1 -O0 -std=c++11 -MT
> slave/containerizer/libmesos_no_3rdparty_la-containerizer.lo -MD -MP
> -MF slave/containerizer/.deps/libmesos_no_3rdparty_la-containerizer.Tpo
> -c ../../src/slave/containerizer/containerizer.cpp  -fPIC -DPIC -o
> slave/containerizer/.libs/libmesos_no_3rdparty_la-containerizer.o
> In file included from ../../src/slave/http.cpp:30:
> In file included from ../../include/mesos/authorizer/authorizer.hpp:25:
> ../../3rdparty/libprocess/include/process/future.hpp:1089:3: error: no
> matching member function for call to 'set'
>   set(u);
>   ^~~
> ../../src/slave/http.cpp:3196:10: note: in instantiation of function
> template specialization
> 'process::Future::Future process::Future > >' requested here
>   return slave->containerizer->attach(containerId)
>  ^
> ../../3rdparty/libprocess/include/process/future.hpp:597:8: note:
> candidate function not viable: no known conversion from 'const
> process::Future >' to
> 'const process::http::Response' for 1st argument
>   bool set(const T& _t);
>^
> ../../3rdparty/libprocess/include/process/future.hpp:598:8: note:
> candidate function not viable: no known conversion from 'const
> process::Future >' to
> 'process::http::Response' for 1st argument
>   bool set(T&& _t);
>^
>
>
>
>
>
>
>
> On Mon, Oct 22, 2018 at 12:53 AM Gilbert Song  wrote:
>
> > Hi all,
> >
> > Please vote on releasing the following candidate as Apache Mesos 1.5.2.
> >
> > 1.5.2 includes the following:
> >
> >
> 
> >   * [MESOS-3790] - ZooKeeper connection should retry on `EAI_NONAME`.
> >   * [MESOS-8128] - Make os::pipe file descriptors O_CLOEXEC.
> >   * [MESOS-8418] - mesos-agent high cpu usage because of numerous
> > /proc/mounts reads.
> >   * [MESOS-8545] -
> > AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
> >   * [MESOS-8568] - Command checks should always call
> > `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`.
> >   * [MESOS-8620] - Containers stuck in FETCHING possibly due to
> > unresponsive server.
> >   * [MESOS-8830] - Agent gc on old slave sandboxes could empty persistent
> > volume data.
> >   * [MESOS-8871] - Agent may fail to recover if the agent dies before
> > image store cache checkpointed.
> >   * [MESOS-8904] - Master crash when removing quota.
> >   * [MESOS-8906] - `UriDiskProfileAdaptor` fails 

Proposal: Adding health check definitions to master state output

2018-10-18 Thread Greg Mann
Hi all,
In addition to the health check API change proposal that I recently sent
out, we're considering adding a task's health check definition (when
present) to the 'Task' protobuf message so that it appears in the master's
'/state' endpoint response, as well as the v1 GET_STATE response and the
TASK_ADDED event. This will allow operators to detect the presence and
configuration of health checks on tasks via the operator API, which they
are currently unable to do:

message Task {
  . . .

  optional HealthCheck health_check = 15;

  . . .
}

I wanted to check in with the community regarding this change, since for
very large clusters it could have a non-negligible impact on the size of
the master's state output.

It's worth mentioning that I believe the original intention of the 'Task'
message was to contain most information contained in 'TaskInfo', except for
those fields which could grow very large, like the 'data' field.

Please reply if you foresee this change having a negative impact on your
deployments, or if you have any other thoughts/concerns!

Thanks,
Greg


Request for Comments - Health Check API Proposal

2018-10-17 Thread Greg Mann
Hi all,
Some users have recently reported issues with our current implementation of
health checks. See this ticket
 for an introduction to
the issue.

To summarize: we currently use a single 'optional bool healthy' field
within the 'TaskStatus' message to indicate the result of a health check.
This allows us to expose 3 health states to users:
1) 'healthy' field is unset = no health check specified, or health check
failed but grace period has not yet elapsed, or health check has not yet
been attempted
2) 'healthy' field is set to 'false' = a health check is specified and it
returned 'false'
3) 'healthy' field is set to 'true' = a health check is specified and it
returned 'true'

The issue is that some users need to distinguish between the three
scenarios in #1: no health check is specified, OR the task is not yet
healthy but we are in the grace period. An example use case would be a load
balancer which needs to wait for a healthy status to route traffic, but
which immediately routes traffic to tasks which have no health check
defined.

This issue was recognized during the design of Mesos generalized checks;
for those checks, we use the presence of the 'check_status' field to
indicate whether or not a check is defined for the task. While consumers
could make use of generalized checks as a workaround, this does not allow
them to both detect the presence of a check AND achieve the task-killing
behavior that health checks provide.

In order to address this, I would like to propose the following new
message, and an addition to the 'TaskStatus' message:

message HealthCheckStatusInfo {
  enum Status {
UNKNOWN = 0;
HEALTHY = 1;
UNHEALTHY = 2;
  }

  required Status status = 0;
}

message TaskStatus {
  . . .

  optional HealthCheckStatusInfo health_check_status = 17;

  . . .
}

The semantics of these fields would be as follows:

'health_status' field:
- If set, a health check has been set
- If unset, a health check has not been set

'health_status.status' field:
- UNKNOWN: The task has not become healthy but is still within its grace
period (this state is also used if an internal error prevents us from
running the health check successfully)
- HEALTHY: The health check indicates the task is healthy
- UNHEALTHY: The health check indicates the task is not healthy

This change would also involve deprecating the existing 'healthy' field. In
accordance with our deprecation policy, I believe we could not remove the
deprecated field until we have a new major release (2.x).

I'd love to hear feedback on this proposal, thanks in advance! I'll also
add this as an agenda item to our upcoming API working group meeting on
Tuesday, Oct. 16 at 11am PST.

Cheers,
Greg


[API WG] Next meeting today, 11am PST

2018-10-16 Thread Greg Mann
Hi all,
Please join us for the Mesos API working group meeting today at 11am PST.
We'll be discussing proposed updates to the 'CREATE_DISK' API, as well as
possible updates to the way Mesos exposes health checks to frameworks in
task status updates.

Feel free to add more items to the agenda here
!
The meeting will be held at the following Zoom link:
https://zoom.us/j/567559753

Cheers,
Greg


This Month in Mesos: September 2018

2018-10-05 Thread Greg Mann
Hi all,
It's time again for your monthly dose of news from Apache Mesosland! Here's
a recap of the happenings from September:

Mesos 1.7.0
Of course, the big news this past month is the new version! Apache Mesos
1.7.0 has been released; huge thanks to release managers Gastón Kleiman and
Chun-Hung Hsiao, as well as all the contributors in the community, for all
their hard work! For more information, see the release blog post

.


MesosCon 2018
MesosCon 2018 will be held in San Francisco from Nov. 5-7! We have an
exciting lineup of talks this year, and the full schedule will be posted
soon at https://mesoscon18.sched.com/ - buy your tickets now!


Containerization
A lot of containerization-related work has landed recently, including:

   - A new 'linux/devices' isolator, which automatically populates
   containers with devices that have been whitelisted in the
   '--allowed_devices' agent flag.
   - Better container network statistics.
   - Better container image pulling metrics.
   - Many bug fixes!

Find more info in the agenda/notes document

.


Performance
Performance improvements have landed in a variety of components within the
codebase including metrics, containerization, and resource allocation:

   - Additional work on the parallelization of master API requests.
   - More allocator optimizations, for improved resource allocation
   performance!
   - A new benchmark test fixture for the allocator.

More information is in the agenda/notes document

.


That's it for September! Look out for next month's update, and until then,
see you on Mesos community Slack !

Cheers,
Greg


[API WG] Meeting today cancelled

2018-10-02 Thread Greg Mann
Hi all,
We don't have any agenda items for the API working group meeting today, so
I'm going to cancel it.

As a reminder, if you're designing or working on any user-facing API
changes, please bring them to the working group for discussion! You can add
items to the agenda for the next meeting here

.

Cheers,
Greg


[API WG] Meeting in 30 minutes!

2018-09-18 Thread Greg Mann
Hi all,
We'll be holding the API working group meeting in 30 minutes, at 11am PST.
You can find the agenda document, which includes a link for the Zoom
meeting, here
;
feel free to add inew agenda tems! Currently, the only item on the agenda
is some grooming of API-related JIRAs.

Cheers,
Greg


[API WG] Meeting today

2018-09-04 Thread Greg Mann
Hi all,
We're having an API working group meeting this morning at 11am PST. I'll be
facilitating a discussion about the future of metrics in Mesos. If you have
any other topics for  discussion, feel free to add them to the agenda:
https://docs.google.com/document/d/1JrF7pA6gcBZ6iyeP5YgDG62ifn0cZIBWw1f_Ler6fLM/edit

Cheers,
Greg


Re: [VOTE] Release Apache Mesos 1.7.0 (rc2)

2018-08-28 Thread Greg Mann
+1 (binding)

Tested in our internal CI; several test failures were observed but they are
known flaky tests.

Also ran our internal DC/OS integration tests against this build, and the
results look good.


Thanks for the reminder, Chun-Hung!

Greg

On Tue, Aug 28, 2018 at 9:14 AM, Chun-Hung Hsiao 
wrote:

> Folks,
>
> This is a gentle reminder for 1.7.0-rc2.
> The vote is open until Wed Aug 29 23:59:59 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> Thanks!
>
> On Fri, Aug 24, 2018, 4:45 PM Chun-Hung Hsiao 
> wrote:
>
>> Hi all,
>>
>> Since there will be a weekend during the vote period,
>> the vote will be open until Wed Aug 29 23:59:59 PDT 2018,
>> so we can have more time testing.
>>
>> Best,
>> Chun-Hung
>>
>> On Fri, Aug 24, 2018 at 4:42 PM Chun-Hung Hsiao 
>> wrote:
>>
>>> Hi all,
>>>
>>> Please vote on releasing the following candidate as Apache Mesos 1.7.0.
>>>
>>>
>>> 1.7.0 includes the following:
>>> 
>>> 
>>> * Performance Improvements:
>>>   * Master `/state` endpoint: ~130% throughput improvement through
>>> RapidJSON
>>>   * Allocator: Improved allocator cycle significantly
>>>   * Agent `/containers` endpoint: Fixed a performance issue
>>>   * Agent container launch / destroy throughput is significantly improved
>>> * Containerization:
>>>   * **Experimental** Supported docker image tarball fetching from HDFS
>>>   * Added new `cgroups/all` and `linux/devices` isolators
>>>   * Added metrics for `network/cni` isolator and docker pull latency
>>> * Windows:
>>>   * Added support to libprocess for the Windows Thread Pool API
>>> * Multi-Framework Workloads:
>>>   * **Experimental** Added per-framework metrics to the master
>>>   * A new weighted random sorter was added as an alternative to the DRF
>>> sorter
>>>
>>> The CHANGELOG for the release is available at:
>>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=blob_
>>> plain;f=CHANGELOG;hb=1.7.0-rc2
>>> 
>>> 
>>>
>>> The candidate for Mesos 1.7.0 release is available at:
>>> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/
>>> mesos-1.7.0.tar.gz
>>>
>>> The tag to be voted on is 1.7.0-rc2:
>>> https://gitbox.apache.org/repos/asf?p=mesos.git;a=commit;h=1.7.0-rc2
>>>
>>> The SHA512 checksum of the tarball can be found at:
>>> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/
>>> mesos-1.7.0.tar.gz.sha512
>>>
>>> The signature of the tarball can be found at:
>>> https://dist.apache.org/repos/dist/dev/mesos/1.7.0-rc2/
>>> mesos-1.7.0.tar.gz.asc
>>>
>>> The PGP key used to sign the release is here:
>>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>>
>>> The JAR is in a staging repository here:
>>> https://repository.apache.org/content/repositories/orgapachemesos-1233
>>>
>>> Please vote on releasing this package as Apache Mesos 1.7.0!
>>>
>>> The vote is open until Mon Aug 27 16:37:35 PDT 2018 and passes if a
>>> majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Mesos 1.7.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> Thanks,
>>> Chun-Hung & Gaston
>>>
>>


[Community WG] Next meeting Monday 8/27, 10:30am PST

2018-08-24 Thread Greg Mann
Hi all,
The next Mesos community working group meeting will be held this coming
Monday, 8/27 at 10:30am PST. If you have any items for the agenda, please
add them here

!

Thus far on the agenda, we have:

   - Discussion on stale Github PRs
   - MesosCon
   - Discussion on the possibility of hosting a "testathon"

If you're available, it would be great to see you there!

Cheers,
Greg


Re: Follow up to discussion regarding use : in paths on Windows (MESOS-9109)

2018-08-23 Thread Greg Mann
Thanks Andy! Responses inlined below.



> No: As the only character we've run into a problem with is `:`
> (MESOS-9109), it might not be worth it to generalize this to solve a bunch
> of problems that we haven't encountered.
>
>
It's true that I'm not aware of other scenarios where filesystem-disallowed
characters in task/executor IDs have caused issues for users, and this
issue has existed for a long time. However, when feasible I would like to
fix issues that we're aware of before they cause problems for users, rather
than after. I would suggest that since we have one compelling case that we
need to address now, it's worth formulating an approach for the general
case, so that we can be sure any current work doesn't get in our way later
on.


> I'm somewhat comfortable doing so only for Windows, as we don't really
> need to worry about the recovery scenario; but very uncomfortable about
> doing so for Linux etc., for precisely that reason.
>
> So expanding this is definitely up for debate; but we must fix the bug
> with `:`.
>
>
Indeed, addressing the general case may prove to be much more complex - I
can certainly identify with this situation, where a fix for a smaller issue
turns into a big project :)
It may turn out to be possible to implement a scoped-down solution for the
colon case now, and extend it later on. I think it would be good if we
could at least get an idea of how we want to handle the general case now,
so that any short-term solutions can be a constructive step toward the
long-term.

Cheers,
G


Re: Follow up to discussion regarding use : in paths on Windows (MESOS-9109)

2018-08-22 Thread Greg Mann
Thanks for addressing this Andy!! AFAIK we allow all characters in executor
and task IDs; I'm surprised we haven't run into issues like this before on
Linux.

The percent-encoding approach seems fine to me. As long as the percent
character isn't an issue on any filesystems that we're interested in? As a
starting point, Wikipedia seems to have a decent survey of restrictions on
different filesystems here
.
Looks like the percent character may be fine.

I wonder if there are other characters we should be concerned about? I'm
guessing we should worry about slashes and backslashes as well? Seems like
a more general solution might help us avoid similar pitfalls in the future.
Perhaps we could just percent-encode executor and task IDs before we write
to disk? If we did this, we would have issues during recovery to consider,
where we need to look for "old" paths when recovering state from an "old"
agent.

In any case, I'm wondering if this warrants a general solution that could
take care of all filesystem-disallowed characters. WDYT?

Cheers,
Greg


On Tue, Aug 21, 2018 at 2:02 PM, Andrew Schwartzmeyer <
and...@schwartzmeyer.com> wrote:

> Hey all,
>
> I have a set of patches up for MESOS-9109 that I need reviewed, starting
> here: https://reviews.apache.org/r/68297/.
>
> Eduard here was trying to use Chronos to schedule a task on a Windows
> agent, and found an error due to the fact that Chronos uses colons (as in
> `:`) in its generated framework (and task) IDs. Now, to maintain backward
> compatibility, we obviously can't disallow the use of `:` as there are
> frameworks already using it. However, this is a reserved character on
> Windows for file system paths (https://docs.microsoft.com/en
> -us/windows/desktop/FileIO/naming-a-file), so it cannot be in the path.
>
> My first implementation simply applied `s/:/_COLON_` to `frameworkId` and
> `taskId` in the functions in `paths.cpp` which generate Mesos's filesystem
> paths. While this worked, it's kind of a kludge. Or that is to say, it
> would nicer to use the ASCII representation of `%3A` instead. Doing so,
> however, revealed a bug in libprocess (MESOS-9168) that I have also fixed
> and need reviewed, starting here: https://reviews.apache.org/r/68420/
>
> So combining the two fixes, the chain maps `:` in `frameworkId` and
> `taskId` to `%3A` (and back when appropriate). This obviously doesn't fix
> any third-party tooling, but being Windows, I don't think there is any yet
> to worry about.
>
> I wanted to get this in for 1.7, but due to a miscommunication, we were
> not able to land it in time. If you can, please review! Or if you have a
> better way of doing this, let me know!
>
> Thanks,
>
> Andy
>
> P.S. Original discussion here: https://mesos.slack.com/archiv
> es/C1LPTK50T/p153332465396 (our Slack archives seem to be down, so
> this is only available until Slack cycles out sadly).
>


Re: API Working Group Tomorrow

2018-08-21 Thread Greg Mann
Hi all,
We don't have any agenda items for the API working group today, so let's
cancel this one.

I'll be leading a discussion next time, Tues 9/4, on the future of metrics
in Mesos. Be sure to tune in for that one, it would be great to get
community feedback on what people would like to see in the next evolution
of Mesos metrics!

Cheers,
Greg


On Mon, Aug 20, 2018 at 11:40 AM, Greg Mann  wrote:

> Hi all,
> The next scheduled API working group meeting is tomorrow at 11am PST.
> There are currently no items on the agenda - please feel free to add them
> here
> <https://docs.google.com/document/d/1JrF7pA6gcBZ6iyeP5YgDG62ifn0cZIBWw1f_Ler6fLM/edit#>!
> If you're currently working on or planning any changes to user-facing APIs,
> this is a great opportunity to get community feedback.
>
> Cheers,
> Greg
>


API Working Group Tomorrow

2018-08-20 Thread Greg Mann
Hi all,
The next scheduled API working group meeting is tomorrow at 11am PST. There
are currently no items on the agenda - please feel free to add them here
!
If you're currently working on or planning any changes to user-facing APIs,
this is a great opportunity to get community feedback.

Cheers,
Greg


This Month in Mesos: August 2018

2018-08-15 Thread Greg Mann
Hi all,
My apologies for the lack of emails during the last few months - I'm going
to try to get back into the routine! Here's your August update on recent
developments in the Mesos community, organized by working group:

Containerization
This has continued to be an area of active development, with the following
features recently merged:

   - Automatic image garbage collection for Mesos containerizer
   - HDFS fetching of Docker images in Mesos containerizer
   - Auto cgroup support
   - Container cgroup FS mounts
   - Many bug fixes!

Find more info in the agenda/notes document

.


Performance
Performance improvements have landed in a variety of components within the
codebase including metrics, containerization, and resource allocation:

   - Faster generation of metrics snapshots
   - Benchmark testing of containerizer performance
   - Quota-related performance improvements in the allocator
   - Parallel processing of master state requests

More information is in the agenda/notes document

.


Community
The biggest news on the community front is the progress on organizing the
next MesosCon! MesosCon 2018 will be held in New York City from Nov. 5-7.
Talk proposals are being accepted until Aug. 27th, submit yours at
https://mesoscon2018.org/ !

We also recently moved the Mesos repository to gitbox, which allows us to
integrate better with GitHub and will hopefully enable some improvements to
our committers' tooling in the near future.

More information in the agenda/notes document

.


API
Just a couple items to report here:

   - Persistent volumes can now be resized with the GROW and SHRINK_VOLUME
   operations
   - Per-framework metrics have been added which provide useful stats for
   every framework that registers with the master

More information in the agenda/notes document

.


Operations
Many thanks to Gastón Kleiman for spearheading this new working group! The
first meeting was held recently, with the next one coming up on Aug. 28 at
9am PST.

One notable change which came out of the first meeting is the movement of
the 'mesos_exporter' metrics processing tool into the Mesos GitHub org; it
can now be found at https://github.com/mesos/mesos_exporter.

More information in the agenda/notes document

.


Mesos 1.7.0
Chun-Hung and Gastón are managing the 1.7.0 release, which is just around
the corner! They're planning to cut the first release candidate on Monday,
Aug. 20th. Keep your eyes peeled for their email, and please help test and
vote!


That's it for this month, thanks for all the hard work everyone! See you at
the next working group meetings :)

Cheers,
Greg


Re: Backport Policy

2018-07-26 Thread Greg Mann
>
>>>> I like how you summarized it Greg and I would vote for leaving the
>>>> decision
>>>> to the committer too. In addition to what others mentioned, I think
>>>> committer should've the responsibility because if things break in a
>>>> point
>>>> release (after it is released), it is the committer and contributor who
>>>> are
>>>> on the hook to triage and fix it and not the release manager.
>>>>
>>>> Having said that, if "during" the release process (i.e., cutting an RC)
>>>> these backports cause delays for a release manager in getting the
>>>> release
>>>> out (e.g., CI flakiness introduced due to backports), release manager
>>>> could
>>>> be the ultimate arbiter on whether such a backport should be reverted or
>>>> fixed by the committer/contributor. Hopefully such issues are caught
>>>> much
>>>> before a release process is started (e.g., CI running against release
>>>> branches).
>>>>
>>>> On Mon, Jul 16, 2018 at 1:28 PM Jie Yu  wrote:
>>>>
>>>> > Greg, I like your idea of adding a prescriptive "policy" when
>>>> evaluating
>>>> > whether a bug fix should be backported, and leave the decision to
>>>> committer
>>>> > (because they have the most context, and avoid a bottleneck in the
>>>> > process).
>>>> >
>>>> > - Jie
>>>> >
>>>> > On Mon, Jul 16, 2018 at 11:24 AM, Greg Mann 
>>>> wrote:
>>>> >
>>>> > > My impression is that we have two opposing schools of thought here:
>>>> > >
>>>> > >1. Backport as little as possible, to avoid unforeseen
>>>> consequences
>>>> > >2. Backport as much as proves practical, to eliminate bugs in
>>>> > >supported versions
>>>> > >
>>>> > > Do other people agree with this assessment?
>>>> > >
>>>> > > If so, how can we find common ground? One possible solution would
>>>> be to
>>>> > > leave the decision on backporting up to the committer, without
>>>> > specifying a
>>>> > > project-wide policy. This seems to be the status quo, and would
>>>> lead to
>>>> > > some variation across committers regarding what types of fixes are
>>>> > > backported. We could also choose to delegate the decision to the
>>>> release
>>>> > > manager; I favor leaving the decision with the committer, to
>>>> eliminate
>>>> > the
>>>> > > burden on release managers.
>>>> > >
>>>> > > Here's a thought: rather than defining a prescriptive "policy" that
>>>> we
>>>> > > expect committers to abide by, we could enumerate in the
>>>> documentation
>>>> > the
>>>> > > competing concerns that we expect committers to consider when making
>>>> > > decisions on backports. The committing docs could read something
>>>> like:
>>>> > >
>>>> > > "When bug fixes are committed to master, the committer should
>>>> evaluate
>>>> > the
>>>> > > fix to determine whether or not it should be backported to supported
>>>> > > versions. This is left to the committer, but they are expected to
>>>> weigh
>>>> > the
>>>> > > following concerns when making the decision:
>>>> > >
>>>> > >- Every backported change comes with a risk of unintended
>>>> > >consequences. The change should be carefully evaluated to ensure
>>>> that
>>>> > such
>>>> > >side-effects are highly unlikely.
>>>> > >- As the complexity of applying a backport increases due to merge
>>>> > >conflicts, the likelihood of unintended consequences also
>>>> increases.
>>>> > Bug
>>>> > >fixes which require extensive rebasing should only be backported
>>>> when
>>>> > the
>>>> > >bug is critical enough to warrant the risk.
>>>> > >- Users of supported versions benefit greatly from the
>>>> resolution of
>>>> > >bugs in point releases. Thus, wheneve

[RESULT][VOTE] Release Apache Mesos 1.6.1 (rc2)

2018-07-25 Thread Greg Mann
Hi all,

The vote for Mesos 1.6.1 (rc2) has passed with the
following votes:

+1 (Binding)
--
Chun-Hung Hsiao
Vinod Kone
Gastón Kleiman


There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.6.1

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.6.1

The mesos-1.6.1.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks,
Greg


Re: [VOTE] Release Apache Mesos 1.6.1 (rc2)

2018-07-25 Thread Greg Mann
Sorry for the delay!! Release email is forthcoming.

On Wed, Jul 25, 2018 at 2:40 AM, Stephan Erb 
wrote:

> The vote for 1.6.1 appears to have passed. Any chance we can get this
> released soon?
>
> Thanks!
>
>
> On 19.07.18, 01:11, "Gastón Kleiman"  wrote:
>
> +1 (binding)
>
> Tested on our internal CI. All green!
> Tested on CentOS 7 and the following tests failed:
>
> [  FAILED  ] DockerContainerizerTest.ROOT_DOCKER_Launch_Executor
> [  FAILED  ] CgroupsIsolatorTest.ROOT_CGROUPS_CFS_EnableCfs
> [  FAILED  ] CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_Listen
> [  FAILED  ]
> NvidiaGpuTest.ROOT_INTERNET_CURL_CGROUPS_NVIDIA_GPU_NvidiaDockerImage
> [  FAILED  ]
> bool/UserContainerLoggerTest.ROOT_LOGROTATE_
> RotateWithSwitchUserTrueOrFalse/0,
> where GetParam() = true
>
> They are all known to be flaky.
>
> On Wed, Jul 11, 2018 at 6:15 PM Greg Mann  wrote:
>
> > Hi all,
> >
> > Please vote on releasing the following candidate as Apache Mesos
> 1.6.1.
> >
> >
> > 1.6.1 includes the following:
> >
> > 
> 
> > *Announce major features here*
> > *Announce major bug fixes here*
> >
> > The CHANGELOG for the release is available at:
> >
> > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_
> plain;f=CHANGELOG;hb=1.6.1-rc2
> >
> > 
> 
> >
> > The candidate for Mesos 1.6.1 release is available at:
> > https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/
> mesos-1.6.1.tar.gz
> >
> > The tag to be voted on is 1.6.1-rc2:
> > https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=
> commit;h=1.6.1-rc2
> >
> > The SHA512 checksum of the tarball can be found at:
> >
> > https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/
> mesos-1.6.1.tar.gz.sha512
> >
> > The signature of the tarball can be found at:
> >
> > https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/
> mesos-1.6.1.tar.gz.asc
> >
> > The PGP key used to sign the release is here:
> > https://dist.apache.org/repos/dist/release/mesos/KEYS
> >
> > The JAR is in a staging repository here:
> > https://repository.apache.org/content/repositories/
> orgapachemesos-1230
> >
> > Please vote on releasing this package as Apache Mesos 1.6.1!
> >
> > The vote is open until Mon Jul 16 18:15:00 PDT 2018 and passes if a
> > majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Mesos 1.6.1
> > [ ] -1 Do not release this package because ...
> >
> > Thanks,
> > Greg
> >
>
>
>


Re: [API WG] Meeting tomorrow, 11am PST

2018-07-24 Thread Greg Mann
Hey folks,
It looks like we don't have an agenda for the meeting today, so let's
cancel this one.

The next API working group meeting will be on August 7; see you then!

Cheers,
Greg

On Mon, Jul 23, 2018 at 11:04 AM, Greg Mann  wrote:

> Hi all,
> We have an API working group meeting scheduled tomorrow at 11am PST. There
> are currently no items on the agenda - if you have something to discuss in
> the meeting, please add it here
> <https://docs.google.com/document/d/1JrF7pA6gcBZ6iyeP5YgDG62ifn0cZIBWw1f_Ler6fLM/edit?usp=sharing>
> !
>
> Cheers,
> Greg
>


[API WG] Meeting tomorrow, 11am PST

2018-07-23 Thread Greg Mann
Hi all,
We have an API working group meeting scheduled tomorrow at 11am PST. There
are currently no items on the agenda - if you have something to discuss in
the meeting, please add it here

!

Cheers,
Greg


Re: [VOTE] Move the project repos to gitbox

2018-07-17 Thread Greg Mann
+1

On Tue, Jul 17, 2018 at 9:39 AM, Jie Yu  wrote:

> +1
>
> On Tue, Jul 17, 2018 at 9:38 AM, Andrew Schwartzmeyer <
> and...@schwartzmeyer.com> wrote:
>
>> +1
>>
>>
>>
>> On 07/17/2018 8:54 am, Zhitao Li wrote:
>>
>> +1
>>
>> On Tue, Jul 17, 2018 at 8:10 AM James Peach  wrote:
>>
>>>
>>>
>>> > On Jul 17, 2018, at 7:58 AM, Vinod Kone  wrote:
>>> >
>>> > Hi,
>>> >
>>> > As discussed in another thread and in the committers sync, there seem
>>> to be heavy interest in moving our project repos ("mesos", "mesos-site")
>>> from the "git-wip" git server to the new "gitbox" server to better avail
>>> GitHub integrations.
>>> >
>>> > Please vote +1, 0, -1 regarding the move to gitbox. The vote will
>>> close in 3 business days.
>>>
>>>
>>> +1
>>
>>
>>
>> --
>> Cheers,
>>
>> Zhitao Li
>>
>>
>


Re: Backport Policy

2018-07-16 Thread Greg Mann
My impression is that we have two opposing schools of thought here:

   1. Backport as little as possible, to avoid unforeseen consequences
   2. Backport as much as proves practical, to eliminate bugs in supported
   versions

Do other people agree with this assessment?

If so, how can we find common ground? One possible solution would be to
leave the decision on backporting up to the committer, without specifying a
project-wide policy. This seems to be the status quo, and would lead to
some variation across committers regarding what types of fixes are
backported. We could also choose to delegate the decision to the release
manager; I favor leaving the decision with the committer, to eliminate the
burden on release managers.

Here's a thought: rather than defining a prescriptive "policy" that we
expect committers to abide by, we could enumerate in the documentation the
competing concerns that we expect committers to consider when making
decisions on backports. The committing docs could read something like:

"When bug fixes are committed to master, the committer should evaluate the
fix to determine whether or not it should be backported to supported
versions. This is left to the committer, but they are expected to weigh the
following concerns when making the decision:

   - Every backported change comes with a risk of unintended consequences.
   The change should be carefully evaluated to ensure that such side-effects
   are highly unlikely.
   - As the complexity of applying a backport increases due to merge
   conflicts, the likelihood of unintended consequences also increases. Bug
   fixes which require extensive rebasing should only be backported when the
   bug is critical enough to warrant the risk.
   - Users of supported versions benefit greatly from the resolution of
   bugs in point releases. Thus, whenever concerns #1 and #2 can be allayed
   for a given bug fix, it should be backported."


Cheers,
Greg


On Mon, Jul 16, 2018 at 3:06 AM, Alex Rukletsov  wrote:

> Back porting as little as possible is the ultimate goal for me. My reasons
> are closely aligned with what Andrew wrote above.
>
> If we agree on this strategy, the next question is how to enforce it. My
> intuition is that committers will lean towards back porting their patches
> in arguable cases, because humans tend to overestimate the importance of
> their personal work. Delegating the decision in such cases to a release
> manager in my opinion will help us enforce the strategy of minimal number
> backports. As a bonus, the release manager will have a much better
> understanding of what's going on with the release, keyword: "more
> ownership".
>
> On Sat, Jul 14, 2018 at 12:07 AM, Andrew Schwartzmeyer <
> and...@schwartzmeyer.com> wrote:
>
>> I believe I fall somewhere between Alex and Ben.
>>
>> As for deciding what to backport or not, I lean toward Alex's view of
>> backporting as little as possible (and agree with his criteria). My
>> reasoning is that all changes can have unforeseen consequences, which I
>> believe is something to be actively avoided in already released versions.
>> The reason for backporting patches to fix regressions is the same as the
>> reason to avoid backporting as much as possible: keep behavior consistent
>> (and safe) within a release. With that as the goal of a branch in
>> maintenance mode, it makes sense to fix regressions, and make exceptions to
>> fix CVEs and other critical/blocking issues.
>>
>> As for who should decide what to backport, I lean toward Ben's view of
>> the burden being on the committer. I don't think we should add more work
>> for release managers, and I think the committer/shepherd obviously has the
>> most understanding of the context around changes proposed for backport.
>>
>> Here's an example of a recent bugfix which I backported:
>> https://reviews.apache.org/r/67587/ (for MESOS-3790)
>>
>> While normally I believe this change falls under "avoid due to unforeseen
>> consequences," I made an exception as the bug was old, circa 2015,
>> (indicating it had been an issue for others), and was causing recurring
>> failures in testing. The fix itself was very small, meaning it was easier
>> to evaluate for possible side effects, so I felt a little safer in that
>> regard. The effect of not having the fix was a fatal and undesired crash,
>> which furthermore left troublesome side effects on the system (you couldn't
>> bring the agent back up). And lastly, a dependent project (DC/OS) wanted it
>> in their next bump, which necessitated backporting to the release they were
>> pulling in.
>>
>> I think in general we should backport only as necessary, and leave it on
>> the committers to decide if backporting a particular change is necessary.
>>
>>
>> On 07/13/2018 12:54 am, Alex Rukletsov wrote:
>>
>>> This is exactly where our views differ, Ben : )
>>>
>>> Ideally, I would like a release manager to have more ownership and less
>>> manual work. In my imagination, a release manager has 

Re: Backport Policy

2018-07-13 Thread Greg Mann
It seems to me that putting the burden of deciding on backports on the
release manager would actually increase the amount of work required. Simply
cutting the release on a particular date is pretty quick - however,
examining tickets to determine whether or not a particular fix should be
backported seems like more effort?

The backport policy strikes me as a community decision, rather than an
individual decision. The community discusses (as we're doing now :) and
settles on a policy, which individual committers then execute. Putting each
release manager in charge of the backport policy for that particular
release would lead to inconsistency across releases. I'm not sure that this
is a terrible thing, but it's not a particularly good thing either.

I would propose that we establish a single backport policy that we all do
our best to execute, with an understanding that there will always be room
for exceptions in some situations.

I like the idea of backporting all bug fixes which apply relatively
cleanly. In addition, very critical bug fixes are worth backporting even
when extensive work is required to backport them.


Alex, could you elaborate on why you would like to backport as little as
possible? I would like to better understand your motivations there :)

Cheers,
Greg


On Fri, Jul 13, 2018 at 2:40 PM, Jie Yu  wrote:

> I typically backport all bug fixes that cleanly apply and the risk is low.
> It's a judgement call, but many of the time, you can easily tell the risk
> is low.
>
> I think my argument on why we want to do this is "why not". I want our
> software to have less bugs!
>
> Letting release manager decides which patch to backport or not does not
> scale. Some release managers might even become dormant after a while.
>
> - Jie
>
> On Fri, Jul 13, 2018 at 12:54 AM, Alex Rukletsov 
> wrote:
>
>> This is exactly where our views differ, Ben : )
>>
>> Ideally, I would like a release manager to have more ownership and less
>> manual work. In my imagination, a release manager has more power and
>> control about dates, features, backports and everything that is related to
>> "their" branch. I would also like us to back port as little as possible,
>> to
>> simplify testing and releasing patch versions.
>>
>> On Fri, Jul 13, 2018 at 1:17 AM, Benjamin Mahler 
>> wrote:
>>
>> > +user, I probably it would be good to hear from users as well.
>> >
>> > Please see the original proposal as well as Alex's proposal and let us
>> know
>> > your thoughts.
>> >
>> > To continue the discussion from where Alex left off:
>> >
>> > > Other bugs and significant improvements, e.g., performance, may be
>> back
>> > ported,
>> > the release manager should ideally be the one who decides on this.
>> >
>> > I'm a little puzzled by this, why is the release manager involved? As we
>> > already document, backports occur when the bug is fixed, so this
>> happens in
>> > the steady state of development, not at release time. The release
>> manager
>> > only comes in at the time of the release itself, at which point all
>> > backports have already happened and the release manager handles the
>> release
>> > process. Only blocker level issues can stop the release and while the
>> > release manager has a strong say, we should generally agree on what
>> > consists of a release blocking issue.
>> >
>> > Just to clarify my workflow, I generally backport every bug fix I commit
>> > that applies cleanly, right after I commit it to master (with the
>> > exceptions I listed below).
>> >
>> > On Thu, Jul 12, 2018 at 8:39 AM, Alex Rukletsov 
>> > wrote:
>> >
>> > > I would like to back port as little as possible. I suggest the
>> following
>> > > criteria:
>> > >
>> > > * By default, regressions are back ported to existing release
>> branches. A
>> > > bug is considered a regression if the functionality is present in the
>> > > previous minor or patch version and is not affected by the bug there.
>> > >
>> > > * Critical and blocker issues, e.g., a CVE, can be back ported.
>> > >
>> > > * Other bugs and significant improvements, e.g., performance, may be
>> back
>> > > ported, the release manager should ideally be the one who decides on
>> > this.
>> > >
>> > > On Thu, Jul 12, 2018 at 12:25 AM, Vinod Kone 
>> > wrote:
>> > >
>> > > > Ben, thanks for the clarification. I'm in agreement with the points
>> you
>> > > > made.
>> > > >
>> > > > Once we have consensus, would you mind updating the doc?
>> > > >
>> > > > On Wed, Jul 11, 2018 at 5:15 PM Benjamin Mahler > >
>> > > > wrote:
>> > > >
>> > > > > I realized recently that we aren't all on the same page with
>> > > backporting.
>> > > > > We currently only document the following:
>> > > > >
>> > > > > "Typically the fix for an issue that is affecting supported
>> releases
>> > > > lands
>> > > > > on the master branch and is then backported to the release
>> > branch(es).
>> > > In
>> > > > > rare cases, the fix might directly go into a release branch
>> without
>> > > > landing
>> > > > > on 

Re: [VOTE] Release Apache Mesos 1.6.1 (rc2)

2018-07-11 Thread Greg Mann
Whoops, I forgot to include the list of changes included in this release -
sorry!

1.6.1-rc2 includes the following notable bug fixes:

  * [MESOS-3790] - ZooKeeper connection should retry on `EAI_NONAME`.
  * [MESOS-8830] - Agent gc on old slave sandboxes could empty persistent
volume data
  * [MESOS-8871] - Agent may fail to recover if the agent dies before image
store cache checkpointed.
  * [MESOS-8904] - Master crash when removing quota.
  * [MESOS-8936] - Implement a Random Sorter for offer allocations.
  * [MESOS-8945] - Master check failure due to CHECK_SOME(providerId).
  * [MESOS-8963] - Executor crash trying to print container ID.
  * [MESOS-8980] - mesos-slave can deadlock with docker pull.
  * [MESOS-8986] - `slave.available()` in the allocator is expensive and
drags down allocation performance.
  * [MESOS-8987] - Master asks agent to shutdown upon auth errors.
  * [MESOS-9002] - GCC 8.1 build failure in os::Fork::Tree.
  * [MESOS-9024] - Mesos master segfaults with stack overflow under load.
  * [MESOS-9025] - The container which joins CNI network and has checkpoint
enabled will be mistakenly destroyed by agent.

Cheers,
Greg

On Wed, Jul 11, 2018 at 6:15 PM, Greg Mann  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.6.1.
>
>
> 1.6.1 includes the following:
> 
> 
> *Announce major features here*
> *Announce major bug fixes here*
>
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_
> plain;f=CHANGELOG;hb=1.6.1-rc2
> 
> 
>
> The candidate for Mesos 1.6.1 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz
>
> The tag to be voted on is 1.6.1-rc2:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.6.1-rc2
>
> The SHA512 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/
> mesos-1.6.1.tar.gz.sha512
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/
> mesos-1.6.1.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1230
>
> Please vote on releasing this package as Apache Mesos 1.6.1!
>
> The vote is open until Mon Jul 16 18:15:00 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.6.1
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Greg
>


[VOTE] Release Apache Mesos 1.6.1 (rc2)

2018-07-11 Thread Greg Mann
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.6.1.


1.6.1 includes the following:

*Announce major features here*
*Announce major bug fixes here*

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.6.1-rc2


The candidate for Mesos 1.6.1 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz

The tag to be voted on is 1.6.1-rc2:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.6.1-rc2

The SHA512 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz.sha512

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc2/mesos-1.6.1.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1230

Please vote on releasing this package as Apache Mesos 1.6.1!

The vote is open until Mon Jul 16 18:15:00 PDT 2018 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.6.1
[ ] -1 Do not release this package because ...

Thanks,
Greg


Re: Normalization of metric keys

2018-07-09 Thread Greg Mann
Good idea; I think percent-encoding sounds great. Unless there are any
objections, I'll go with that approach.

On Fri, Jul 6, 2018 at 5:32 PM, Benjamin Mahler  wrote:

> Do we also want:
>
> 3. Has an unambiguous decoding.
>
> Replacing '/' with '#%$' means I don't know if the user actually supplied
> '#%$' or '/'. But using something like percent-encoding would have property
> 3.
>
> On Fri, Jul 6, 2018 at 10:25 AM, Greg Mann  wrote:
>
>> Thanks for the reply Ben!
>>
>> Yea I suspect the lack of normalization there was not intentional, and it
>> means that you can no longer reliably split on '/' unless you apply some
>> external controls to user input. Yep, this is bad :)
>>
>> One thing we should consider when normalizing metadata embedded in metric
>> keys (like framework name/ID) is that operators will likely want to
>> de-normalize this information in their metrics tooling. For example,
>> ideally something like the 'mesos_exporter' [1] could expose the framework
>> name/ID as tags which could be easily consumed by the cluster's metrics
>> infrastructure.
>>
>> To accommodate de-normalization, any substitutions we perform while
>> normalizing should be:
>>
>>1. Unique - we should substitute a single, unique string for each
>>disallowed character
>>2. Verbose - we should substitute strings which are unlikely to
>>appear in user input. (Examples: '#

Re: Normalization of metric keys

2018-07-06 Thread Greg Mann
Thanks for the reply Ben!

Yea I suspect the lack of normalization there was not intentional, and it
means that you can no longer reliably split on '/' unless you apply some
external controls to user input. Yep, this is bad :)

One thing we should consider when normalizing metadata embedded in metric
keys (like framework name/ID) is that operators will likely want to
de-normalize this information in their metrics tooling. For example,
ideally something like the 'mesos_exporter' [1] could expose the framework
name/ID as tags which could be easily consumed by the cluster's metrics
infrastructure.

To accommodate de-normalization, any substitutions we perform while
normalizing should be:

   1. Unique - we should substitute a single, unique string for each
   disallowed character
   2. Verbose - we should substitute strings which are unlikely to appear
   in user input. (Examples: 

Re: [VOTE] Release Apache Mesos 1.6.1 (rc1)

2018-07-03 Thread Greg Mann
Hey folks, an update on the 1.6.1-rc2 candidate: an issue surfaced after
the fix was merged for MESOS-8830, which is being addressed currently. I'll
be AFK for the next 3 days, so I'll cut 1.6.1-rc2 this coming Monday. Sorry
for the delay!

Cheers,
Greg

On Mon, Jul 2, 2018 at 12:30 PM, Greg Mann  wrote:

> Thanks for voting! Since a -1 vote was cast, I'll be cutting another
> release candidate shortly. Keep your eyes peeled for the email!
>
> Cheers,
> Greg
>
> On Fri, Jun 29, 2018 at 12:03 PM, Chun-Hung Hsiao 
> wrote:
>
>> -1 on https://issues.apache.org/jira/browse/MESOS-8830.
>>
>> This is a critical bug that would wipe out persistent data. I'm
>> backporting
>> this to 1.4, 1.5 and 1.6.
>>
>> On Fri, Jun 29, 2018 at 9:05 AM Greg Mann  wrote:
>>
>> > The failures here are mostly command executor/default executor tests.
>> > Looking at the test output, it seems that the tasks in these tests
>> failed
>> > to start successfully and send task status updates. I haven't seen this
>> > issue on our internal CI; I'll try to re-run the build on ASF CI and if
>> the
>> > failures occur again, investigate why that environment is experiencing
>> this
>> > problem.
>> >
>> > -Greg
>> >
>> > On Wed, Jun 27, 2018 at 1:58 PM, Vinod Kone 
>> wrote:
>> >
>> >> Hmm. Lot of tests failed when I ran this through ASF CI. Not sure if
>> all
>> >> of these are known flaky tests?
>> >>
>> >>
>> >> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Rele
>> ase/50/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--
>> verbose%20--enable-libevent%20--enable-ssl,ENVIRONMENT=
>> GLOG_v=1%20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%
>> 7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/console
>> >>
>> >>
>> >> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-Rele
>> ase/50/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--
>> verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%
>> 3A14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!
>> ubuntu-eu2)/console
>> >>
>> >> On Wed, Jun 27, 2018 at 11:59 AM Jie Yu  wrote:
>> >>
>> >>> +1
>> >>>
>> >>> Passed on our internal CI that has the following matrix. I looked into
>> >>> the only failed test, looks to be a flaky test due to a race in the
>> test.
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Jun 26, 2018 at 7:02 PM, Greg Mann 
>> wrote:
>> >>>
>> >>>> Hi all,
>> >>>>
>> >>>> Please vote on releasing the following candidate as Apache Mesos
>> 1.6.1.
>> >>>>
>> >>>>
>> >>>> 1.6.1 includes the following:
>> >>>>
>> >>>> 
>> 
>> >>>> *Announce major features here*
>> >>>> *Announce major bug fixes here*
>> >>>>
>> >>>> The CHANGELOG for the release is available at:
>> >>>>
>> >>>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_p
>> lain;f=CHANGELOG;hb=1.6.1-rc1
>> >>>>
>> >>>> 
>> 
>> >>>>
>> >>>> The candidate for Mesos 1.6.1 release is available at:
>> >>>>
>> >>>> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc1/mesos
>> -1.6.1.tar.gz
>> >>>>
>> >>>> The tag to be voted on is 1.6.1-rc1:
>> >>>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit
>> ;h=1.6.1-rc1
>> >>>>
>> >>>> The SHA512 checksum of the tarball can be found at:
>> >>>>
>> >>>> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc1/mesos
>> -1.6.1.tar.gz.sha512
>> >>>>
>> >>>> The signature of the tarball can be found at:
>> >>>>
>> >>>> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc1/mesos
>> -1.6.1.tar.gz.asc
>> >>>>
>> >>>> The PGP key used to sign the release is here:
>> >>>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>> >>>>
>> >>>> The JAR is in a staging repository here:
>> >>>> https://repository.apache.org/content/repositories/orgapache
>> mesos-1229
>> >>>>
>> >>>> Please vote on releasing this package as Apache Mesos 1.6.1!
>> >>>>
>> >>>> The vote is open until Fri Jun 29 18:46:28 PDT 2018 and passes if a
>> >>>> majority of at least 3 +1 PMC votes are cast.
>> >>>>
>> >>>> [ ] +1 Release this package as Apache Mesos 1.6.1
>> >>>> [ ] -1 Do not release this package because ...
>> >>>>
>> >>>> Thanks,
>> >>>> Greg
>> >>>>
>> >>>
>> >>>
>> >
>>
>
>


Normalization of metric keys

2018-07-03 Thread Greg Mann
Hi all!
I'm currently working on adding a suite of new per-framework metrics to
help schedulers better debug unexpected/unwanted behavior (MESOS-8842
). One issue that has
come up during this work is how we should handle strings like the framework
name or role name in metric keys, since those strings may contain
characters like '/' which already have a meaning in our metrics interface.
I intend to place the framework name and ID in the keys for the new
per-framework metrics, delimited by a sufficiently-unique separator so that
operators can decode the name/ID in their metrics tooling. An example
per-framework metric key:

master/frameworks/###/tasks/task_running


I recently realized that we actually already allow the '/' character in
metric keys, since we include the framework principal in these keys:

frameworks//messages_received
frameworks//messages_processed

We don't disallow any characters in the principal, so anything could appear
in those keys.

*Since we don't normalize the principal in the above keys, my proposal is
that we do not normalize the framework name at all when constructing the
new per-framework metric keys.*


Let me know what you think!

Cheers,
Greg


Re: [VOTE] Release Apache Mesos 1.6.1 (rc1)

2018-06-29 Thread Greg Mann
The failures here are mostly command executor/default executor tests.
Looking at the test output, it seems that the tasks in these tests failed
to start successfully and send task status updates. I haven't seen this
issue on our internal CI; I'll try to re-run the build on ASF CI and if the
failures occur again, investigate why that environment is experiencing this
problem.

-Greg

On Wed, Jun 27, 2018 at 1:58 PM, Vinod Kone  wrote:

> Hmm. Lot of tests failed when I ran this through ASF CI. Not sure if all
> of these are known flaky tests?
>
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/50/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose%20--
> enable-libevent%20--enable-ssl,ENVIRONMENT=GLOG_v=1%
> 20MESOS_VERBOSE=1,OS=centos%3A7,label_exp=(docker%7C%
> 7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/console
>
> https://builds.apache.org/view/M-R/view/Mesos/job/Mesos-
> Release/50/BUILDTOOL=autotools,COMPILER=gcc,CONFIGURATION=--verbose,
> ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu%3A14.04,
> label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)&&(!ubuntu-eu2)/console
>
> On Wed, Jun 27, 2018 at 11:59 AM Jie Yu  wrote:
>
>> +1
>>
>> Passed on our internal CI that has the following matrix. I looked into
>> the only failed test, looks to be a flaky test due to a race in the test.
>>
>>
>>
>> On Tue, Jun 26, 2018 at 7:02 PM, Greg Mann  wrote:
>>
>>> Hi all,
>>>
>>> Please vote on releasing the following candidate as Apache Mesos 1.6.1.
>>>
>>>
>>> 1.6.1 includes the following:
>>> 
>>> 
>>> *Announce major features here*
>>> *Announce major bug fixes here*
>>>
>>> The CHANGELOG for the release is available at:
>>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_
>>> plain;f=CHANGELOG;hb=1.6.1-rc1
>>> 
>>> 
>>>
>>> The candidate for Mesos 1.6.1 release is available at:
>>> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc1/
>>> mesos-1.6.1.tar.gz
>>>
>>> The tag to be voted on is 1.6.1-rc1:
>>> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.6.1-rc1
>>>
>>> The SHA512 checksum of the tarball can be found at:
>>> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc1/
>>> mesos-1.6.1.tar.gz.sha512
>>>
>>> The signature of the tarball can be found at:
>>> https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc1/
>>> mesos-1.6.1.tar.gz.asc
>>>
>>> The PGP key used to sign the release is here:
>>> https://dist.apache.org/repos/dist/release/mesos/KEYS
>>>
>>> The JAR is in a staging repository here:
>>> https://repository.apache.org/content/repositories/orgapachemesos-1229
>>>
>>> Please vote on releasing this package as Apache Mesos 1.6.1!
>>>
>>> The vote is open until Fri Jun 29 18:46:28 PDT 2018 and passes if a
>>> majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Mesos 1.6.1
>>> [ ] -1 Do not release this package because ...
>>>
>>> Thanks,
>>> Greg
>>>
>>
>>


[VOTE] Release Apache Mesos 1.6.1 (rc1)

2018-06-26 Thread Greg Mann
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.6.1.


1.6.1 includes the following:

*Announce major features here*
*Announce major bug fixes here*

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.6.1-rc1


The candidate for Mesos 1.6.1 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc1/mesos-1.6.1.tar.gz

The tag to be voted on is 1.6.1-rc1:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.6.1-rc1

The SHA512 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc1/mesos-1.6.1.tar.gz.sha512

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.6.1-rc1/mesos-1.6.1.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1229

Please vote on releasing this package as Apache Mesos 1.6.1!

The vote is open until Fri Jun 29 18:46:28 PDT 2018 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.6.1
[ ] -1 Do not release this package because ...

Thanks,
Greg


Re: Proposing change to the allocatable check in the allocator

2018-06-12 Thread Greg Mann
Hi all,
We had a nice discussion about this in the API working group meeting today.
I agree that it's a good idea to do our best to make this change compatible
with future updates to the Request call and/or quota. I think it would be
beneficial to have a meeting in a few days to brainstorm some ideas; please
let me know if you would like to be included in that meeting and I will add
you to an invite!

Cheers,
Greg


On Tue, Jun 12, 2018 at 8:06 AM, Alex Rukletsov  wrote:

> Instead of the master flag, why not a master API call. This will allow to
> update the value without restarting the master.
>
> Another thought is that we should explain operators how and when to use
> this knob. For example, if they observe a behavioural pattern A, then it
> means B is happening, and tuning the knob to C might help.
>
> On Tue, Jun 12, 2018 at 7:36 AM, Jie Yu  wrote:
>
>> I would suggest we also consider the possibility of adding per framework
>> control on `min_allocatable_resources`.
>>
>> If we want to consider supporting per-framework setting, we should
>> probably
>> model this as a protobuf, rather than a free form JSON. The same protobuf
>> can be reused for both master flag, framework API, or even supporting
>> Resource Request in the future. Something like the following:
>>
>> message ResourceQuantityPredicate {
>>   enum Type {
>> SCALAR_GE,
>>   }
>>   optional Type type;
>>   optional Value.Scalar scalar;
>> }
>> message ResourceRequirement {
>>   required string resource_name;
>>   oneof predicates {
>> ResourceQuantityPredicate quantity;
>>   }
>> }
>> message ResourceRequirementList {
>>   // All requirements MUST be met.
>>   repeated ResourceRequirement requirements;
>> }
>>
>> // Resource request API.
>> message Request {
>>   repeated ResoruceRequrementList accepted;
>> }
>>
>> // `allocatable()`
>> message MinimalAllocatableResources {
>>   repeated ResoruceRequrementList accepted;
>> }
>>
>> On Mon, Jun 11, 2018 at 3:47 PM, Meng Zhu  wrote:
>>
>> > Hi:
>> >
>> > The allocatable
>> > > ator/mesos/hierarchical.cpp#L2471-L2479>
>> >  check in the allocator (shown below) was originally introduced to
>> >
>> > help alleviate the situation where a framework receives some resources,
>> > but no
>> >
>> > cpu/memory, thus cannot launch a task.
>> >
>> >
>> > constexpr double MIN_CPUS = 0.01;constexpr Bytes MIN_MEM =
>> Megabytes(32);
>> > bool HierarchicalAllocatorProcess::allocatable(
>> > const Resources& resources)
>> > {
>> >   Option cpus = resources.cpus();
>> >   Option mem = resources.mem();
>> >
>> >   return (cpus.isSome() && cpus.get() >= MIN_CPUS) ||
>> >  (mem.isSome() && mem.get() >= MIN_MEM);
>> > }
>> >
>> >
>> > Issues
>> >
>> > However, there has been a couple of issues surfacing lately surrounding
>> > the check.
>> >
>> >-
>> >- - MESOS-8935 Quota limit "chopping" can lead to cpu-only and
>>
>> >memory-only offers.
>> >
>> > We introduced fined-grained quota-allocation (MESOS-7099) in Mesos 1.5.
>> > When we
>> >
>> > allocate resources to a role, we'll "chop" the available resources of
>> the
>> > agent up to the
>> >
>> > quota limit for the role. However, this has the unintended consequence
>> of
>> > creating
>> >
>> > cpu-only and memory-only offers, even though there might be other agents
>> > with both
>> >
>> > cpu and memory resources available in the cluster.
>> >
>> >
>> > - MESOS-8626 The 'allocatable' check in the allocator is problematic
>> with
>> > multi-role frameworks.
>> >
>> > Consider roleA reserved cpu/memory on an agent and roleB reserved disk
>> on
>> > the same agent.
>> >
>> > A framework under both roleA and roleB will not be able to get the
>> > reserved disk due to the
>> >
>> > allocatable check. With the introduction of resource providers, the
>> > similar situation will
>> >
>> > become more common.
>> >
>> > Proposed change
>> >
>> > Instead of hardcoding a one-size-fits-all value in Mesos, we are
>> proposing
>> > to add a new master flag
>> >
>> > min_allocatable_resources. It specifies one or more scalar resources
>> > quantities that define the
>> >
>> > minimum allocatable resources for the allocator. The allocator will only
>> > offer resources that are more
>> >
>> > than at least one of the specified resources.  The default behavior *is
>> > backward compatible* i.e.
>> >
>> > by default, the flag is set to “cpus:0.01|mem:32”.
>> >
>> > Usage
>> >
>> > The flag takes in either a simple text of resource(s) delimited by a bar
>> > (|) or a JSON array of JSON
>> >
>> > formatted resources. Note, the input should be “pure” scalar quantities
>> > i.e. the specified resource(s)
>> >
>> > should only have name, type (set to scalar) and scalar fields set.
>> >
>> >
>> > Examples:
>> >
>> >- - To eliminate cpu or memory only offer due to the quota chopping,
>> >- we could set the flag to “cpus:0.01;mem:32”
>> >-
>> >- - To enable offering disk 

Re: Doc-a-thon - May 24th

2018-05-17 Thread Greg Mann
Hi all,
Just a reminder about the Mesos Doc-a-thon coming up next Thursday, May 24
starting at 3pm PST! You can join in person (RSVP here
) or
online (link to join ). It would be great to
see you there!

Cheers,
Greg


On Fri, Apr 13, 2018 at 10:23 AM, Judith Malnick 
wrote:

> Hi everyone,
>
> The next Mesos Doc-a-thon will be on May 24th from 3:00-8:00 pm Pacific
> time.
>
> You can join in person (RSVP here
> ) or
> online (link to join ).
>
> We'll be brainstorming project suggestions over the next few weeks, so if
> you think of any sections of documentation that need improvement, please
> note them in the agenda doc
> 
> .
>
> Looking forward to seeing you on the 24th!
>
> All the best,
> Judith
> --
> Judith Malnick
> Community Manager
> 310-709-1517
>


Soliciting documentation feedback

2018-05-17 Thread Greg Mann
Hi everyone,

As part of our ongoing effort to improve the Mesos docs, we're looking for
your help. What areas of the Mesos documentation need the most improvement?
Do you have projects to suggest or mistakes to flag?

We'll be compiling this feedback into project suggestions for the May 24th
Doc-a-thon
, or in
Jira tickets if they need more discussion. Thanks in advance for your
honest opinions!

Cheers,
Greg


Re: [VOTE] Release Apache Mesos 1.5.1 (rc1)

2018-05-15 Thread Greg Mann
+1 (binding)

I did `sudo make check` and verified that only expected flaky tests failed.


Cheers,
Greg

On Fri, May 11, 2018 at 12:35 PM, Gilbert Song  wrote:

> Hi all,
>
> Please vote on releasing the following candidate as Apache Mesos 1.5.1.
>
> 1.5.1 includes the following:
> 
> 
> * [MESOS-1720] - Slave should send exited executor message when the
> executor is never launched.
> * [MESOS-7742] - Race conditions in IOSwitchboard: listening on unix socket
> and premature closing of the connection.
> * [MESOS-8125] - Agent should properly handle recovering an executor when
> its pid is reused.
> * [MESOS-8411] - Killing a queued task can lead to the command executor
> never terminating.
> * [MESOS-8416] - CHECK failure if trying to recover nested containers but
> the framework checkpointing is not enabled.
> * [MESOS-8468] - `LAUNCH_GROUP` failure tears down the default executor.
> * [MESOS-8488] - Docker bug can cause unkillable tasks.
> * [MESOS-8510] - URI disk profile adaptor does not consider plugin type for
> a profile.
> * [MESOS-8536] - Pending offer operations on resource provider resources
> not properly accounted for in allocator.
> * [MESOS-8550] - Bug in `Master::detected()` leads to coredump in
> `MasterZooKeeperTest.MasterInfoAddress`.
> * [MESOS-8552] - CGROUPS_ROOT_PidNamespaceForward and
> CGROUPS_ROOT_PidNamespaceBackward tests fail.
> * [MESOS-8565] - Persistent volumes are not visible in Mesos UI when
> launching a pod using default executor.
> * [MESOS-8569] - Allow newline characters when decoding base64 strings in
> stout.
> * [MESOS-8574] - Docker executor makes no progress when 'docker inspect'
> hangs.
> * [MESOS-8575] - Improve discard handling for 'Docker::stop' and
> 'Docker::pull'.
> * [MESOS-8576] - Improve discard handling of 'Docker::inspect()'.
> * [MESOS-8577] - Destroy nested container if
> `LAUNCH_NESTED_CONTAINER_SESSION` fails.
> * [MESOS-8594] - Mesos master stack overflow in libprocess socket send
> loop.
> * [MESOS-8598] - Allow empty resource provider selector in
> `UriDiskProfileAdaptor`.
> * [MESOS-8601] - Master crashes during slave reregistration after failover.
> * [MESOS-8604] - Quota headroom tracking may be incorrect in the presence
> of hierarchical reservation.
> * [MESOS-8605] - Terminal task status update will not send if 'docker
> inspect' is hung.
> * [MESOS-8619] - Docker on Windows uses `USERPROFILE` instead of `HOME` for
> credentials.
> * [MESOS-8624] - Valid tasks may be explicitly dropped by agent due to race
> conditions.
> * [MESOS-8631] - Agent should be able to start a task with every CPU on a
> Windows machine.
> * [MESOS-8641] - Event stream could send heartbeat before subscribed.
> * [MESOS-8646] - Agent should be able to resolve file names on open files.
> * [MESOS-8651] - Potential memory leaks in the `volume/sandbox_path`
> isolator.
> * [MESOS-8741] - `Add` to sequence will not run if it races with sequence
> destruction.
> * [MESOS-8742] - Agent resource provider config API calls should be
> idempotent.
> * [MESOS-8786] - CgroupIsolatorProcess accesses subsystem processes
> directly.
> * [MESOS-8787] - RP-related API should be experimental.
> * [MESOS-8876] - Normal exit of Docker container using rexray volume
> results in TASK_FAILED.
> * [MESOS-8881] - Enable epoll backend in libevent integration.
> * [MESOS-8885] - Disable libevent debug mode.
>
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_
> plain;f=CHANGELOG;hb=1.5.1-rc1
> 
> 
>
> The candidate for Mesos 1.5.1 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.1-rc1/mesos-1.5.1.tar.gz
>
> The tag to be voted on is 1.5.1-rc1:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.5.1-rc1
>
> The SHA512 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.1-rc1/
> mesos-1.5.1.tar.gz.sha512
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.5.1-rc1/
> mesos-1.5.1.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1224
>
> Please vote on releasing this package as Apache Mesos 1.5.1!
>
> The vote is open until Wed May 16 12:31:02 PDT 2018 and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.5.1
> [ ] -1 Do not release this package because ...
>
> Thanks,
> Gilbert
>


[RESULT][VOTE] Release Apache Mesos 1.6.0 (rc1)

2018-05-11 Thread Greg Mann
Hi all,

The vote for Mesos 1.6.0 (rc1) has passed with the
following votes.

+1 (Binding)
--
Vinod Kone
Chun-Hung Hsiao
James Peach
Zhitao Li
Andrew Schwartzmeyer


There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.6.0

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.6.0

The mesos-1.6.0.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks!
Greg


Re: Getting roles' info in Scheduler

2018-05-10 Thread Greg Mann
Hi Pascal,
This isn't possible directly with the SchedulerDriver, but your scheduler
could use the 'GET_ROLES' call of the operator API [1] for this purpose.

Cheers,
Greg

[1]
http://mesos.apache.org/documentation/latest/operator-http-api/#get_roles

On Sat, May 5, 2018 at 3:37 AM, Pascal Gillet 
wrote:

> Hi All,
>
> Is it possible to get all the roles, specifically weighted roles, declared
> in the Mesos master from a SchedulerDriver in Java?
>
> Thanks
>
> Pascal GILLET
>


[VOTE] Release Apache Mesos 1.6.0 (rc1)

2018-05-07 Thread Greg Mann
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.6.0.


1.6.0 includes the following:

* Resizing of persistent volumes for agent default resources
* Offer operation feedback for resource provider resources
* Docker executor/containerizer improvements for graceful handling of
Docker failures
* Support for jemalloc on Linux

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.6.0-rc1


The candidate for Mesos 1.6.0 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.6.0-rc1/mesos-1.6.0.tar.gz

The tag to be voted on is 1.6.0-rc1:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.6.0-rc1

The SHA512 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.6.0-rc1/mesos-1.6.0.tar.gz.sha512

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.6.0-rc1/mesos-1.6.0.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1223

Please vote on releasing this package as Apache Mesos 1.6.0!

The vote is open until Thu May 10 20:45:34 PDT 2018 and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.6.0
[ ] -1 Do not release this package because ...

Thanks,
Greg


UPDATE: Mesos 1.6.0 release

2018-05-04 Thread Greg Mann
Hey folks,
We're very nearly ready to cut the first release candidate for 1.6, but not
quite :) Rather than rushing it, I'd prefer to wait until Monday. I
apologize for the further delay!

Thanks to everybody for your hard work resolving blockers over the past
couple weeks!! It is much appreciated. I have created the 1.6.x branch
and *will cut
Mesos 1.6 RC1 on Monday, May 7. *Keep your eyes peeled on the  mailing list
for the email, your help in testing is greatly appreciated!

Thanks for your patience!

Cheers,
Greg


On Tue, Apr 24, 2018 at 3:44 PM, Greg Mann <g...@mesosphere.io> wrote:

> Hi all,
> Several contributors who are either assignees or shepherds for 1.6 blocker
> tickets (myself included) have recently been pulled into some high priority
> internal work, so unfortunately I would like to delay the cut of the first
> release candidate for Mesos 1.6.0 to ensure that we have adequate time to
> resolve all blockers without rushing code in at the last minute.
>
> I'll delay the release candidate by one week, so *now I plan to cut RC1
> on Friday, May 4*.
>
> For those of you working on getting those last 10 blocker tickets merged,
> this gives you one more week to do so. Please aim to get all 1.6 blockers
> merged by the end of Thursday, May 3.
>
> If you have concerns about the new timeline, feel free to reach out to me
> and we can see if accelerating it would be possible.
>
> Thanks everybody, and sorry for the delay!
> Cheers,
> Greg
>


[API WG] Meeting tomorrow!

2018-04-30 Thread Greg Mann
Hi all,
The API working group will meet tomorrow, May 1, from 11:00-11:50am PST.
We'll be chatting about a proposal for per-framework metrics

.

Feel free to add more items to the agenda doc

!
To join the meeting, use the Zoom link provided in the agenda.

Cheers,
Greg


UPDATE: Mesos 1.6.0 release

2018-04-24 Thread Greg Mann
Hi all,
Several contributors who are either assignees or shepherds for 1.6 blocker
tickets (myself included) have recently been pulled into some high priority
internal work, so unfortunately I would like to delay the cut of the first
release candidate for Mesos 1.6.0 to ensure that we have adequate time to
resolve all blockers without rushing code in at the last minute.

I'll delay the release candidate by one week, so *now I plan to cut RC1 on
Friday, May 4*.

For those of you working on getting those last 10 blocker tickets merged,
this gives you one more week to do so. Please aim to get all 1.6 blockers
merged by the end of Thursday, May 3.

If you have concerns about the new timeline, feel free to reach out to me
and we can see if accelerating it would be possible.

Thanks everybody, and sorry for the delay!
Cheers,
Greg


Re: Doc-a-thon May 24th?

2018-04-13 Thread Greg Mann
Sounds good to me - thanks Judith!!

On Wed, Apr 11, 2018 at 1:35 PM, Judith Malnick 
wrote:

> Hi All,
>
> I'd like the next Mesos Doc-a-thon to happen on May 24th from 3-8pm
> Pacific time. I picked the date because it's best for Ben H.
>
> Does anyone have major reasons why this wouldn't work? If not I'll put it
> on the calendar and start setting it up.
>
> Best,
> Judith
> --
> Judith Malnick
> Community Manager
> 310-709-1517
>


Proposal: Constrained upgrades from Mesos 1.6

2018-04-10 Thread Greg Mann
Hi all,
We are currently working on patches to implement the new GROW_VOLUME and
SHRINK_VOLUME operations [1]. In order to make it into Mesos 1.6, we're
pursuing a workaround which affects the way these operations are accounted
for in the Mesos master. These operations will be marked as *experimental* in
Mesos 1.6.

As a result of this workaround, upgrades from Mesos 1.6 to later versions
would be affected. Specifically, 1.6 masters would not be able to properly
account for the resources of failed GROW/SHRINK operations on 1.7+ agents.
This means that when upgrading from Mesos 1.6, if GROW_VOLUME or
SHRINK_VOLUME operations are being used during the upgrade, the masters
*must* be upgraded first. If we follow this proposal, this constraint would
be clearly spelled out in our upgrade documentation.

Since, in general, we guarantee compatibility between Mesos masters and
agents of the same major version, we wanted to check with the community to
see if this constraint on 1.6 upgrades would be acceptable. Please let us
know what you think!

Cheers,
Greg


[1] https://issues.apache.org/jira/browse/MESOS-4965


Re: Release policy and 1.6 release schedule

2018-04-10 Thread Greg Mann
Thanks for the reviews, y'all! I've got a few "Ship-Its" - I'll commit this
later today unless I hear any objections.

Cheers,
Greg

On Wed, Apr 4, 2018 at 11:49 AM, Greg Mann <g...@mesosphere.io> wrote:

> Hey folks,
> I've posted a proposed update to our documented release schedule:
> https://reviews.apache.org/r/66454/
>
> Please take a look and comment!
>
> Cheers,
> Greg
>
>
> On Mon, Mar 26, 2018 at 11:34 AM, Greg Mann <g...@mesosphere.io> wrote:
>
>> +1 for quarterly. I would also say that we should support 3 releases at
>> any given time, regardless of the duration that implies. If there are no
>> objections, I'll submit a patch to update our docs to this effect. I think
>> that slowing down our documented cadence a bit will give us a chance to
>> faithfully adhere to our stated policy.
>>
>> Alex, I agree that releasing monthly would be great if we had better
>> automation. This is something we can work toward in the future I hope :)
>>
>> Cheers,
>> Greg
>>
>> On Mon, Mar 26, 2018 at 6:49 AM, Alex Rukletsov <a...@mesosphere.com>
>> wrote:
>>
>>> I would like us to do monthly releases and support 10 branches at a time.
>>> Ideally, releasing that often reduces the burden for the release manager,
>>> because there are less changes and less new features. However, we lack
>>> automation to support this pace: our release guide [1] is several pages
>>> long and includes quite a few non-trivial steps. It would be great to
>>> find
>>> some time (maybe during the next Mesos hackathon?) and revisit our
>>> release
>>> procedures, but until then I'm +1 for quarterly.
>>>
>>> [1] https://mesos.apache.org/documentation/latest/release-guide/
>>>
>>> On Sat, Mar 24, 2018 at 5:48 AM, Vinod Kone <vinodk...@gmail.com> wrote:
>>>
>>> > I’m +1 for quarterly.
>>> >
>>> > Most importantly I want us to adhere to a predictable cadence.
>>> >
>>> > Sent from my phone
>>> >
>>> > On Mar 23, 2018, at 9:21 PM, Jie Yu <yujie@gmail.com> wrote:
>>> >
>>> > It's a burden for supporting multiple releases.
>>> >
>>> > 1.2 was released March, 2017 (1 year ago), and I know that some users
>>> are
>>> > still on that version
>>> > 1.3 was released June, 2017 (9 months ago), and we're still
>>> maintaining it
>>> > (still backport patches
>>> > <https://github.com/apache/mesos/commit/064f64552624e38d5dd9
>>> 2660eef6f6940128c106> several
>>> > days ago, which some users asked)
>>> > 1.4 was released Sept, 2017 (6 months ago).
>>> > 1.5 was released Feb, 2018 (1 month ago).
>>> >
>>> > As you can see, users expect a release to be supported 6-9 months
>>> (e.g.,
>>> > backports are still needed for 1.3 release, which is 9 months old). If
>>> we
>>> > were to do monthly minor release, we'll probably need to maintain 6-9
>>> > release branches? That's too much of an ask for committers and
>>> maintainers.
>>> >
>>> > I also agree with folks that there're benefits doing releases more
>>> > frequently. Given the historical data, I'd suggest we do quarterly
>>> > releases, and maintain three release branches.
>>> >
>>> > - Jie
>>> >
>>> > On Fri, Mar 23, 2018 at 10:03 AM, Greg Mann <g...@mesosphere.io>
>>> wrote:
>>> >
>>> >> The best motivation I can think of for a shorter release cycle is
>>> this: if
>>> >> the release cadence is fast enough, then developers will be less
>>> likely to
>>> >> rush a feature into a release. I think this would be a real benefit,
>>> since
>>> >> rushing features in hurts stability. *However*, I'm not sure if every
>>> two
>>> >> months is fast enough to bring this benefit. I would imagine that a
>>> >> two-month wait is still long enough that people wouldn't want to wait
>>> an
>>> >> entire release cycle to land their feature. Just off the top of my
>>> head, I
>>> >> might guess that a release cadence of 1 month or shorter would be
>>> often
>>> >> enough that it would always seem reasonable for a developer to wait
>>> until
>>> >> the next release to land a feature. What do y'all think?
>>> >>
>>> >> Other motivati

Re: Release policy and 1.6 release schedule

2018-04-04 Thread Greg Mann
Hey folks,
I've posted a proposed update to our documented release schedule:
https://reviews.apache.org/r/66454/

Please take a look and comment!

Cheers,
Greg


On Mon, Mar 26, 2018 at 11:34 AM, Greg Mann <g...@mesosphere.io> wrote:

> +1 for quarterly. I would also say that we should support 3 releases at
> any given time, regardless of the duration that implies. If there are no
> objections, I'll submit a patch to update our docs to this effect. I think
> that slowing down our documented cadence a bit will give us a chance to
> faithfully adhere to our stated policy.
>
> Alex, I agree that releasing monthly would be great if we had better
> automation. This is something we can work toward in the future I hope :)
>
> Cheers,
> Greg
>
> On Mon, Mar 26, 2018 at 6:49 AM, Alex Rukletsov <a...@mesosphere.com>
> wrote:
>
>> I would like us to do monthly releases and support 10 branches at a time.
>> Ideally, releasing that often reduces the burden for the release manager,
>> because there are less changes and less new features. However, we lack
>> automation to support this pace: our release guide [1] is several pages
>> long and includes quite a few non-trivial steps. It would be great to find
>> some time (maybe during the next Mesos hackathon?) and revisit our release
>> procedures, but until then I'm +1 for quarterly.
>>
>> [1] https://mesos.apache.org/documentation/latest/release-guide/
>>
>> On Sat, Mar 24, 2018 at 5:48 AM, Vinod Kone <vinodk...@gmail.com> wrote:
>>
>> > I’m +1 for quarterly.
>> >
>> > Most importantly I want us to adhere to a predictable cadence.
>> >
>> > Sent from my phone
>> >
>> > On Mar 23, 2018, at 9:21 PM, Jie Yu <yujie@gmail.com> wrote:
>> >
>> > It's a burden for supporting multiple releases.
>> >
>> > 1.2 was released March, 2017 (1 year ago), and I know that some users
>> are
>> > still on that version
>> > 1.3 was released June, 2017 (9 months ago), and we're still maintaining
>> it
>> > (still backport patches
>> > <https://github.com/apache/mesos/commit/064f64552624e38d5dd9
>> 2660eef6f6940128c106> several
>> > days ago, which some users asked)
>> > 1.4 was released Sept, 2017 (6 months ago).
>> > 1.5 was released Feb, 2018 (1 month ago).
>> >
>> > As you can see, users expect a release to be supported 6-9 months (e.g.,
>> > backports are still needed for 1.3 release, which is 9 months old). If
>> we
>> > were to do monthly minor release, we'll probably need to maintain 6-9
>> > release branches? That's too much of an ask for committers and
>> maintainers.
>> >
>> > I also agree with folks that there're benefits doing releases more
>> > frequently. Given the historical data, I'd suggest we do quarterly
>> > releases, and maintain three release branches.
>> >
>> > - Jie
>> >
>> > On Fri, Mar 23, 2018 at 10:03 AM, Greg Mann <g...@mesosphere.io> wrote:
>> >
>> >> The best motivation I can think of for a shorter release cycle is
>> this: if
>> >> the release cadence is fast enough, then developers will be less
>> likely to
>> >> rush a feature into a release. I think this would be a real benefit,
>> since
>> >> rushing features in hurts stability. *However*, I'm not sure if every
>> two
>> >> months is fast enough to bring this benefit. I would imagine that a
>> >> two-month wait is still long enough that people wouldn't want to wait
>> an
>> >> entire release cycle to land their feature. Just off the top of my
>> head, I
>> >> might guess that a release cadence of 1 month or shorter would be often
>> >> enough that it would always seem reasonable for a developer to wait
>> until
>> >> the next release to land a feature. What do y'all think?
>> >>
>> >> Other motivating factors that have been raised are:
>> >> 1) Many users upgrade on a longer timescale than every ~2 months. I
>> think
>> >> that this doesn't need to affect our decision regarding release timing
>> -
>> >> since we guarantee compatibility of all releases with the same major
>> >> version number, there is no reason that a user needs to upgrade minor
>> >> releases one at a time. It's fine to go from 1.N to 1.(N+3), for
>> example.
>> >> 2) Backporting will be a burden if releases are too short. I think
>> that in
>> >> practice, backporting will not take to

Re: Mesos master endless attemps to kill unexisting task

2018-04-04 Thread Greg Mann
Hi Adam,
The fact that the task does not show up in the Mesos UI doesn't make sense
to me, in light of the logs excerpts you included. The line:

Mar 14 09:56:49 mario mesos-master[23570]: I0314 09:56:49.441658 23602
master.cpp:5371] Telling agent 2215ab84-177b-478b-ab62-4453803fde6c-S6 at
slave(1)@10.99.50.3:5051 (zelda.service.domain.com) to kill task
pub_api_oecd-rest-api-on-port-20015.196f414a-f61f-11e7-856c-f6e84742f1ef of
framework 346d7333-a980-43a8-93ab-343ea12d77d7- (marathon) at
scheduler-66a67553-0692-40b0-b29e-e7f342b6a241@10.99.50.2:40487

indicates that the Mesos master was able to locate this task in its
internal state. So, I would expect the task to show up in the Mesos UI. You
could also look for the task in the output of the GET_TASKS operator API
call for the master
<http://mesos.apache.org/documentation/latest/operator-http-api/#get_tasks>
and the agent
<http://mesos.apache.org/documentation/latest/operator-http-api/#get_tasks-1>
.

Have you looked at the Mesos agent logs to see how the agent is responding
to the KILL calls?

Mesos doesn't store any state in ZK (it's only used for leader election),
so clearing the task there is not an option. It's possible that forcing a
leader election by restarting the current Mesos master may help, but I'm
uncertain what state the master is in currently, given the inconsistency
noted above.

Cheers,
Greg


On Wed, Apr 4, 2018 at 1:09 AM, Adam Cecile <adam.cec...@hitec.lu> wrote:

> For instance,
>
> No kill ack received for instance [pub_api_oecd-rest-api-on-
> port-20015.marathon-196f414a-f61f-11e7-856c-f6e84742f1ef], retrying
> (73402 attempts so far)
>
> I'd say after 73402 attempts, it's time to let it go :D
>
> On 04/04/2018 10:07 AM, Adam Cecile wrote:
>
> Hello list !
>
> Problem is still on-going, any hint how to fix that ? Like removing broken
> app from zookeeper by hand ?
>
> Regards, Adam.
>
> On 03/20/2018 06:04 PM, daemeon reiydelle wrote:
>
> I ran across a situation with the same symptoms last year (with Mesos &
> Marathon) when we had network problems. The mesos task did exit normally
> (eventually found same in the logs), therefore the UUID had aged out.
>
>
> <==>
> "Who do you think made the first stone spear? The Asperger guy.
> If you get rid of the autism genetics, there would be no Silicon Valley"
> Temple Grandin
>
>
> *Daemeon C.M. Reiydelle San Francisco 1.415.501.0198 London 44 020 8144
> 9872*
>
>
> On Tue, Mar 20, 2018 at 1:34 AM, Adam Cecile <adam.cec...@hitec.lu> wrote:
>
>> Hi Greg,
>>
>> Yes I can confirm No kill ack received for instance
>> [pub_api_oecd-rest-api-on-port-20015.marathon-196f414a-f61f-11e7-856c-f6e84742f1ef],
>> retrying (73402 attempts so far)i cannot find this UUID in Mesos interface.
>>
>> Regards, Adam.
>>
>> On 03/15/2018 05:47 PM, Greg Mann wrote:
>>
>> Hi Adam,
>> The KILL calls are being sent to Mesos by Marathon. Since the KILL call
>> is being forwarded to the agent, it seems that the Mesos master is aware of
>> the task. Could you verify that the tasks show up as running in the Mesos
>> UI? You say that the tasks don't exist anymore - how did you verify this?
>> If the tasks show up as running in the Mesos state, but the actual task
>> processes are not running on the agent, then it could indicate an issue
>> with the Mesos agent or executor.
>>
>> Cheers,
>> Greg
>>
>>
>> On Wed, Mar 14, 2018 at 1:59 AM, Adam Cecile <adam.cec...@hitec.lu>
>> wrote:
>>
>>> Hello,
>>>
>>> I see two old tasks being stuck in Mesos. These tasks don't exist
>>> anymore since ages but Mesos still tries to kill them:
>>>
>>>
>>> Mar 14 09:56:49 mario mesos-master[23570]: I0314 09:56:49.441572 23602
>>> master.cpp:5297] Processing KILL call for task
>>> 'pub_api_oecd-rest-api-on-port-20015.196f414a-f61f-11e7-856c-f6e84742f1ef'
>>> of framework 346d7333-a980-43a8-93ab-343ea12d77d7- (marathon) at
>>> scheduler-66a67553-0692-40b0-b29e-e7f342b6a241@10.99.50.2:40487
>>>
>>> Mar 14 09:56:49 mario mesos-master[23570]: I0314 09:56:49.441658 23602
>>> master.cpp:5371] Telling agent 2215ab84-177b-478b-ab62-4453803fde6c-S6
>>> at slave(1)@10.99.50.3:5051 (zelda.service.domain.com) to kill task
>>> pub_api_oecd-rest-api-on-port-20015.196f414a-f61f-11e7-856c-f6e84742f1ef
>>> of framework 346d7333-a980-43a8-93ab-343ea12d77d7- (marathon) at
>>> scheduler-66a67553-0692-40b0-b29e-e7f342b6a241@10.99.50.2:40487
>>>
>>> Mar 14 09:57:09 mario mesos-master[23570]: I0314 09:57:09.441529 23607
>>> master.cpp:5297] Proces

[API WG] Meeting today

2018-04-03 Thread Greg Mann
Hi all,
The API working group will be meeting today at 11am PST. We'll be
discussing HTTP return codes in Mesos [1]. If you have any other items for
discussion, add them to the agenda! [2]

Cheers,
Greg


[1] https://issues.apache.org/jira/browse/MESOS-7697
[2] https://docs.google.com/document/d/1JrF7pA6gcBZ6iyeP5YgDG62ifn0cZ
IBWw1f_Ler6fLM/edit#heading=h.jvt42epwk1q7e


This Month in Mesos - March 2018

2018-03-30 Thread Greg Mann
Oh hai there Apache Mesos Community!

Back again with your monthly update on current events in the Mesosverse:


*Working Groups*

Below you'll find a brief summary of the group meetings from this past
month, as well as some info about related work that's been happening in the
project. Working group meetings can be found on the Mesos community calendar
, and you should feel
free to add agenda items beforehand!


*API Working Group*

[Agenda Doc

]

Next Meeting: April 3 @ 11am PST

In March we held the first two meetings of the new API working group! This
has brought about a revival of our perennial discussion on the preferred
Mesos release cadence; you can expect an updated release policy in our
documentation shortly. It's looking like the new policy will be in line
with what we have been doing in practice for the last few releases, so no
big changes there.


Zhitao also presented his ongoing work on new operations which will allow
the growing/shrinking of persistent volumes. You can find his design doc
here

.


*Containerization Working Group*

[Agenda Doc

]

Next meeting: April 5 @ 9am PST

Two big items in the containerization space this month:


   - Improvements to the Docker containerizer/executor to more gracefully
   handle bugs in the Docker daemon: MESOS-8572
   
   - Configurable network namespaces for nested containers: MESOS-8534
   

*Community Working Group*

[Agenda Doc

]

Next Meeting: April 9 @ 10:30am PST

Community working group had a preliminary discussion about the next
quarterly doc-a-thon, and discussed the possibility of spinning up a new
Releases Working Group. We also discussed plans for the next MesosCon, and
how we may want to evolve that event going forward.


*Performance Working Group*

[Agenda Doc

]

Next meeting: April 18 @ 10am PST

We now have a performance dashboard

which lets you view tickets in ASF JIRA which have been marked as
performance-related - take a look!


Some additional copy elimination
 patches have been
merged, with more yet to come. The group also discussed the near-term
performance roadmap, which includes optimization of
authentication/authorization, master state computation, and the libprocess
HTTP code; see the agenda document for more details.



Until next time,
-Greg


Re: Release policy and 1.6 release schedule

2018-03-26 Thread Greg Mann
+1 for quarterly. I would also say that we should support 3 releases at any
given time, regardless of the duration that implies. If there are no
objections, I'll submit a patch to update our docs to this effect. I think
that slowing down our documented cadence a bit will give us a chance to
faithfully adhere to our stated policy.

Alex, I agree that releasing monthly would be great if we had better
automation. This is something we can work toward in the future I hope :)

Cheers,
Greg

On Mon, Mar 26, 2018 at 6:49 AM, Alex Rukletsov <a...@mesosphere.com> wrote:

> I would like us to do monthly releases and support 10 branches at a time.
> Ideally, releasing that often reduces the burden for the release manager,
> because there are less changes and less new features. However, we lack
> automation to support this pace: our release guide [1] is several pages
> long and includes quite a few non-trivial steps. It would be great to find
> some time (maybe during the next Mesos hackathon?) and revisit our release
> procedures, but until then I'm +1 for quarterly.
>
> [1] https://mesos.apache.org/documentation/latest/release-guide/
>
> On Sat, Mar 24, 2018 at 5:48 AM, Vinod Kone <vinodk...@gmail.com> wrote:
>
> > I’m +1 for quarterly.
> >
> > Most importantly I want us to adhere to a predictable cadence.
> >
> > Sent from my phone
> >
> > On Mar 23, 2018, at 9:21 PM, Jie Yu <yujie@gmail.com> wrote:
> >
> > It's a burden for supporting multiple releases.
> >
> > 1.2 was released March, 2017 (1 year ago), and I know that some users are
> > still on that version
> > 1.3 was released June, 2017 (9 months ago), and we're still maintaining
> it
> > (still backport patches
> > <https://github.com/apache/mesos/commit/064f64552624e38d5dd92660eef6f6
> 940128c106> several
> > days ago, which some users asked)
> > 1.4 was released Sept, 2017 (6 months ago).
> > 1.5 was released Feb, 2018 (1 month ago).
> >
> > As you can see, users expect a release to be supported 6-9 months (e.g.,
> > backports are still needed for 1.3 release, which is 9 months old). If we
> > were to do monthly minor release, we'll probably need to maintain 6-9
> > release branches? That's too much of an ask for committers and
> maintainers.
> >
> > I also agree with folks that there're benefits doing releases more
> > frequently. Given the historical data, I'd suggest we do quarterly
> > releases, and maintain three release branches.
> >
> > - Jie
> >
> > On Fri, Mar 23, 2018 at 10:03 AM, Greg Mann <g...@mesosphere.io> wrote:
> >
> >> The best motivation I can think of for a shorter release cycle is this:
> if
> >> the release cadence is fast enough, then developers will be less likely
> to
> >> rush a feature into a release. I think this would be a real benefit,
> since
> >> rushing features in hurts stability. *However*, I'm not sure if every
> two
> >> months is fast enough to bring this benefit. I would imagine that a
> >> two-month wait is still long enough that people wouldn't want to wait an
> >> entire release cycle to land their feature. Just off the top of my
> head, I
> >> might guess that a release cadence of 1 month or shorter would be often
> >> enough that it would always seem reasonable for a developer to wait
> until
> >> the next release to land a feature. What do y'all think?
> >>
> >> Other motivating factors that have been raised are:
> >> 1) Many users upgrade on a longer timescale than every ~2 months. I
> think
> >> that this doesn't need to affect our decision regarding release timing -
> >> since we guarantee compatibility of all releases with the same major
> >> version number, there is no reason that a user needs to upgrade minor
> >> releases one at a time. It's fine to go from 1.N to 1.(N+3), for
> example.
> >> 2) Backporting will be a burden if releases are too short. I think that
> in
> >> practice, backporting will not take too much longer. If there was a
> >> conflict back in the tree somewhere, then it's likely that after
> resolving
> >> that conflict once, the same diff can be used to backport the change to
> >> previous releases as well.
> >> 3) Adhering strictly to a time-based release schedule will help users
> plan
> >> their deployments, since they'll be able to rely on features being
> >> released
> >> on-schedule. However, if we do strict time-based releases, then it will
> be
> >> less certain that a particular feature will land in a particular
> release,
> &

Re: Release policy and 1.6 release schedule

2018-03-26 Thread Greg Mann
>
> I think the burden of maintaining a release branch is not just
> backporting. We need to run CI to make sure every maintained release branch
> are working, and do testing for that. It's a burden if there are too many
> release branches.
>
>
 That's a good point, we do need to run CI on all supported versions.
However, I think that updates to the release branches are not nearly as
frequent as updates to master branch. So, I think it might actually be
reasonable to run CI on ~10 release branches, since we will only need to
run it when bug fixes get backported.


Re: Release policy and 1.6 release schedule

2018-03-23 Thread Greg Mann
The best motivation I can think of for a shorter release cycle is this: if
the release cadence is fast enough, then developers will be less likely to
rush a feature into a release. I think this would be a real benefit, since
rushing features in hurts stability. *However*, I'm not sure if every two
months is fast enough to bring this benefit. I would imagine that a
two-month wait is still long enough that people wouldn't want to wait an
entire release cycle to land their feature. Just off the top of my head, I
might guess that a release cadence of 1 month or shorter would be often
enough that it would always seem reasonable for a developer to wait until
the next release to land a feature. What do y'all think?

Other motivating factors that have been raised are:
1) Many users upgrade on a longer timescale than every ~2 months. I think
that this doesn't need to affect our decision regarding release timing -
since we guarantee compatibility of all releases with the same major
version number, there is no reason that a user needs to upgrade minor
releases one at a time. It's fine to go from 1.N to 1.(N+3), for example.
2) Backporting will be a burden if releases are too short. I think that in
practice, backporting will not take too much longer. If there was a
conflict back in the tree somewhere, then it's likely that after resolving
that conflict once, the same diff can be used to backport the change to
previous releases as well.
3) Adhering strictly to a time-based release schedule will help users plan
their deployments, since they'll be able to rely on features being released
on-schedule. However, if we do strict time-based releases, then it will be
less certain that a particular feature will land in a particular release,
and users may have to wait a release cycle to get the feature.

Personally, I find the idea of preventing features from being rushed into a
release very compelling. From that perspective, I would love to see
releases every month. However, if we're not going to release that often,
then I think it does make sense to adjust our release schedule to
accommodate the features that community members want to land in a
particular release.


Jie, I'm curious why you suggest a *minimal* interval between releases.
Could you elaborate a bit on your motivations there?

Cheers,
Greg


On Fri, Mar 16, 2018 at 2:01 PM, Jie Yu <yujie@gmail.com> wrote:

> Thanks Greg for starting this thread!
>
>
>> My primary motivation here is to bring our documented policy in line
>> with our practice, whatever that may be
>
>
> +100
>
> Do people think that we should attempt to bring our release cadence more
>> in line with our current stated policy, or should the policy be changed
>> to reflect our current practice?
>
>
> I think a minor release every 2 months is probably too aggressive. I don't
> have concrete data, but my feeling is that the frequency that folks upgrade
> Mesos is low. I know that many users are still on 1.2.x.
>
> I'd actually suggest that we have a *minimal* interval between two
> releases (e.g., 3 months), and provide some buffer for the release process.
> (so we're expecting about 3 releases per year, this matches what we did
> last year).
>
> And we use our dev sync to coordinate on a release after the minimal
> release interval has elapsed (and elect a release manager).
>
> - Jie
>
> On Wed, Mar 14, 2018 at 9:51 AM, Zhitao Li <zhitaoli...@gmail.com> wrote:
>
>> An additional data point is how long it takes from first RC being cut to
>> the final release tag vote passes. That probably indicates smoothness of
>> the release process and how good the quality control measures.
>>
>> I would argue for not delaying release for new features and align with the
>> schedule we declared on policy. That makes upstream projects easier to
>> gauge when a feature will be ready and when they can try it out.
>>
>> On Tue, Mar 13, 2018 at 3:10 PM, Greg Mann <g...@mesosphere.io> wrote:
>>
>> > Hi folks,
>> > During the recent API working group meeting [1], we discussed the
>> release
>> > schedule. This has been a recurring topic of discussion in the developer
>> > sync meetings, and while our official policy still specifies time-based
>> > releases at a bi-monthly cadence, in practice we tend to gate our
>> releases
>> > on the completion of certain features, and our releases go out on a
>> > less-frequent basis. Here are the dates of our last few release blog
>> posts,
>> > which I'm assuming correlate pretty well with the actual release dates:
>> >
>> > 1.5.0: 2/8/18
>> > 1.4.0: 9/18/17
>> > 1.3.0: 6/7/17
>> > 1.2.0: 3/8/17
>> > 1.1.0: 11/10/16
>> >
>> > Our current ca

Re: Mesos master endless attemps to kill unexisting task

2018-03-15 Thread Greg Mann
Hi Adam,
The KILL calls are being sent to Mesos by Marathon. Since the KILL call is
being forwarded to the agent, it seems that the Mesos master is aware of
the task. Could you verify that the tasks show up as running in the Mesos
UI? You say that the tasks don't exist anymore - how did you verify this?
If the tasks show up as running in the Mesos state, but the actual task
processes are not running on the agent, then it could indicate an issue
with the Mesos agent or executor.

Cheers,
Greg


On Wed, Mar 14, 2018 at 1:59 AM, Adam Cecile  wrote:

> Hello,
>
> I see two old tasks being stuck in Mesos. These tasks don't exist anymore
> since ages but Mesos still tries to kill them:
>
>
> Mar 14 09:56:49 mario mesos-master[23570]: I0314 09:56:49.441572 23602
> master.cpp:5297] Processing KILL call for task
> 'pub_api_oecd-rest-api-on-port-20015.196f414a-f61f-11e7-856c-f6e84742f1ef'
> of framework 346d7333-a980-43a8-93ab-343ea12d77d7- (marathon) at
> scheduler-66a67553-0692-40b0-b29e-e7f342b6a241@10.99.50.2:40487
>
> Mar 14 09:56:49 mario mesos-master[23570]: I0314 09:56:49.441658 23602
> master.cpp:5371] Telling agent 2215ab84-177b-478b-ab62-4453803fde6c-S6 at
> slave(1)@10.99.50.3:5051 (zelda.service.domain.com) to kill task
> pub_api_oecd-rest-api-on-port-20015.196f414a-f61f-11e7-856c-f6e84742f1ef
> of framework 346d7333-a980-43a8-93ab-343ea12d77d7- (marathon) at
> scheduler-66a67553-0692-40b0-b29e-e7f342b6a241@10.99.50.2:40487
>
> Mar 14 09:57:09 mario mesos-master[23570]: I0314 09:57:09.441529 23607
> master.cpp:5297] Processing KILL call for task
> 'pub_api_oecd-rest-api-on-port-20015.196f414a-f61f-11e7-856c-f6e84742f1ef'
> of framework 346d7333-a980-43a8-93ab-343ea12d77d7- (marathon) at
> scheduler-66a67553-0692-40b0-b29e-e7f342b6a241@10.99.50.2:40487
>
> Mar 14 09:57:09 mario mesos-master[23570]: I0314 09:57:09.441617 23607
> master.cpp:5371] Telling agent 2215ab84-177b-478b-ab62-4453803fde6c-S6 at
> slave(1)@10.99.50.3:5051 (zelda.service.domain.com) to kill task
> pub_api_oecd-rest-api-on-port-20015.196f414a-f61f-11e7-856c-f6e84742f1ef
> of framework 346d7333-a980-43a8-93ab-343ea12d77d7- (marathon) at
> scheduler-66a67553-0692-40b0-b29e-e7f342b6a241@10.99.50.2:40487
>
>
> Could you please tell me how to "purge" them from Mesos master ?
>
> Thanks in advance,
>
> Adam.
>


Release policy and 1.6 release schedule

2018-03-13 Thread Greg Mann
Hi folks,
During the recent API working group meeting [1], we discussed the release
schedule. This has been a recurring topic of discussion in the developer
sync meetings, and while our official policy still specifies time-based
releases at a bi-monthly cadence, in practice we tend to gate our releases
on the completion of certain features, and our releases go out on a
less-frequent basis. Here are the dates of our last few release blog posts,
which I'm assuming correlate pretty well with the actual release dates:

1.5.0: 2/8/18
1.4.0: 9/18/17
1.3.0: 6/7/17
1.2.0: 3/8/17
1.1.0: 11/10/16

Our current cadence seems to be around 3-4 months between releases, while
our documentation states that we release every two months [2]. My primary
motivation here is to bring our documented policy in line with our
practice, whatever that may be. Do people think that we should attempt to
bring our release cadence more in line with our current stated policy, or
should the policy be changed to reflect our current practice?

If we were to attempt to align with our stated policy for 1.6.0, then we
would release around April 8, which would probably mean cutting an RC
sometime around the end of March or beginning of April. This is very soon!
:)

I'm currently working with Gastón on offer operation feedback, and I'm not
sure that we would have it ready in time for an early April release date.
Personally, I would be OK with this, since we could land the feature in
1.7.0 in June. However, I'm not sure how well this schedule would work for
the features that other people are currently working on.

I'm curious to hear people's thoughts on this, developers and users alike!

Cheers,
Greg


[1] https://docs.google.com/document/d/1JrF7pA6gcBZ6iyeP5YgDG62ifn0cZ
IBWw1f_Ler6fLM/edit#
[2] http://mesos.apache.org/documentation/latest/
versioning/#release-schedule


API Working Group - First Meeting Tomorrow

2018-03-05 Thread Greg Mann
Hello all,
We'll be having our first API working group meeting tomorrow, March 6 at
11am PST. This working group is a great opportunity for us to work toward
greater consistency and usability of our API, as well as raise issues with
the current interface and plan future directions.

You can find the agenda document here
.
As a first order of business, I think we could spend some time discussing
the mission and scope of the working group, and how best to structure our
meetings to serve that mission. Feel free to add further items for
discussion to the document!

See you at the meeting :)
Greg


This Month in Mesos - February 2018

2018-02-28 Thread Greg Mann
Dear Apache Mesos Community,

Hello all! I've got a short update for you this month with recent
happenings in Mesosland:


*Working Groups*

Here's the latest from the working groups. Working group meetings can be
found on the Mesos community calendar
, and you can feel free
to add agenda items beforehand!


*✨ NEW!! API Working Group ✨*

[Agenda Doc

]

*First Meeting! *March 6 @ 11am PST

Join us for the first meeting of the API working group! If you plan to
attend and have an item for discussion, please add it to the agenda doc!


*Containerization Working Group*

[Agenda Doc

]

Next meeting: March 8 @ 9am PST

There were discussions around providing isolation of the container root
filesystem, as well as configurable network namespaces for child containers
in a task group. JIRA issues and design docs for these items can be found
in the agenda doc.

*Community Working Group*

[Agenda Doc

]

Next Meeting: March 12 @ 10:30am PST

Talked about new working groups, and brainstormed ideas for upcoming blog
posts. In addition to the new API working group, there is a storage working
group in the works.


*Performance Working Group*

[Agenda Doc

]

Next meeting: March 21 @ 10am PST



*1.5.0 Release*
Woohoo! Mesos 1.5.0 is out! A big thanks goes out to Gilbert Song, who
managed this release. You can check out the release blog here
. Notable
changes in this version include:

   - Container Storage Interface (CSI) support
   - v1 API performance improvements
   - Many improvements in Windows support
   - Container image garbage collection


That's all for this month. Keep your eyes peeled for some new blog content,
coming soon!!

Until next time,
-Greg


This Month in Mesos - December 2017

2017-12-12 Thread Greg Mann
Dear Apache Mesos Community,

Development in Mesos has been active lately, with work taking place to
enable things like hybrid cloud and network storage support, as well as
improvements to the scheduler API designed to make the lives of framework
developers easier.

Apache Mesos version 1.5 is just around the corner; we hope to cut the
first release candidate (RC) within the next week, so keep your eyes peeled
on the mailing lists! As always, your help testing this release during the
RC phase is greatly appreciated.

We've also scheduled our 2nd quarterly Doc-A-thon on January 11th, hosted
at Mesosphere HQ in San Francisco! You'll be able to join remotely using
Zoom, or in person. To see an event description and RSVP please visit
the Meetup
page .

Last week I sat down with long-time Mesos committer Jie Yu to discuss the
storage effort that he’s been leading. Find the interview on the Mesos Blog
. And if you haven’t yet checked out Ben
Mahler’s recent performance working group progress report, you can find it
there as well!

Going forward, we’ll endeavor to bring you monthly updates like this on the
latest progress in the project. Next Month, we’ll celebrate the new release
and go into detail on the exciting new features that made it in.

If you have anything you'd like to share in the newsletter (blog posts,
calls for contribution, announcements) please email me or join the
community working group, which meets every other week on Monday at 10:30 am
Pacific time using Zoom . The next meeting
will be on December 18th.

Best,

Greg


[Design Doc] An Improved KillPolicy

2017-09-25 Thread Greg Mann
Hello all!
I've been working on a little design for some improvements to the
KillPolicy. You can find the design doc here

.

TL;DR: the plan is to extend the KillPolicy message to allow the initiation
step of termination to be configurable. The framework can specify that a
user-supplied signal be sent to initiate task termination, or the framework
can supply a CommandInfo which will be executed within the task's
namespaces to initiate termination.

Comments on the design doc would be greatly appreciated!

Cheers,
Greg


Re: Welcome Greg Mann as a new committer and PMC member!

2017-06-15 Thread Greg Mann
Thanks everyone! It's an honor to be part of such a group :)

Looking forward to more contributions and collaborations!!

Cheers,
Greg

On Wed, Jun 14, 2017 at 9:39 AM, Artem Harutyunyan <ar...@mesosphere.io>
wrote:

> Awesome!!! Congrats Greg, very well deserved.
>
> Artem.
>
> On Tue, Jun 13, 2017 at 2:42 PM Vinod Kone <vinodk...@apache.org> wrote:
>
> > Hi folks,
> >
> > Please welcome Greg Mann as the newest committer and PMC member of the
> > Apache Mesos project.
> >
> > Greg has been an active contributor to the Mesos project for close to 2
> > years now and has made many solid contributions. His biggest source code
> > contribution to the project has been around adding authentication support
> > for default executor. This was a major new feature that involved quite a
> > few moving parts. Additionally, he also worked on improving the scheduler
> > and executor APIs.
> >
> > Here is his more formal checklist for your perusal.
> >
> >
> > https://docs.google.com/document/d/1S6U5OFVrl7ySmpJsfD4fJ3_
> R8JYRRc5spV0yKrpsGBw/edit
> >
> > Thanks,
> > Vinod
> >
> >
>


Re: Welcome Gilbert Song as a new committer and PMC member!

2017-05-24 Thread Greg Mann
Congratulations Gilbert!! :D

On Wed, May 24, 2017 at 12:01 PM, Avinash Sridharan 
wrote:

> Congrats Gilbert !! Very well deserved !!
>
> On Wed, May 24, 2017 at 11:56 AM, Timothy Chen  wrote:
>
> > Congrats! Rocking the containerizer world!
> >
> > Tim
> >
> > On Wed, May 24, 2017 at 11:23 AM, Zhitao Li 
> wrote:
> > > Congrats Gilbert!
> > >
> > > On Wed, May 24, 2017 at 11:08 AM, Yan Xu  wrote:
> > >
> > >> Congrats! Well deserved!
> > >>
> > >> ---
> > >> Jiang Yan Xu  | @xujyan 
> > >>
> > >> On Wed, May 24, 2017 at 10:54 AM, Vinod Kone 
> > wrote:
> > >>
> > >>> Congrats Gilbert!
> > >>>
> > >>> On Wed, May 24, 2017 at 1:32 PM, Neil Conway 
> > >>> wrote:
> > >>>
> > >>> > Congratulations Gilbert! Well-deserved!
> > >>> >
> > >>> > Neil
> > >>> >
> > >>> > On Wed, May 24, 2017 at 10:32 AM, Jie Yu 
> > wrote:
> > >>> > > Hi folks,
> > >>> > >
> > >>> > > I' happy to announce that the PMC has voted Gilbert Song as a new
> > >>> > committer
> > >>> > > and member of PMC for the Apache Mesos project. Please join me to
> > >>> > > congratulate him!
> > >>> > >
> > >>> > > Gilbert has been working on Mesos project for 1.5 years now. His
> > main
> > >>> > > contribution is his work on unified containerizer, nested
> container
> > >>> (aka
> > >>> > > Pod) support. He also helped a lot of folks in the community
> > regarding
> > >>> > their
> > >>> > > patches, questions and etc. He also played an important role
> > >>> organizing
> > >>> > > MesosCon Asia last year and this year!
> > >>> > >
> > >>> > > His formal committer checklist can be found here:
> > >>> > > https://docs.google.com/document/d/1iSiqmtdX_0CU-YgpViA6r6PU_
> > >>> > aMCVuxuNUZ458FR7Qw/edit?usp=sharing
> > >>> > >
> > >>> > > Welcome, Gilbert!
> > >>> > >
> > >>> > > - Jie
> > >>> >
> > >>>
> > >>
> > >>
> > >
> > >
> > > --
> > > Cheers,
> > >
> > > Zhitao Li
> >
>
>
>
> --
> Avinash Sridharan, Mesosphere
> +1 (323) 702 5245
>


Re: [VOTE] Release Apache Mesos 1.0.4 (rc1)

2017-04-25 Thread Greg Mann
+1 (non-binding)

Ran `sudo make check` on CentOS 7 with Docker 1.12.1. The only test failure
was: ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout
While I haven't had a chance to look deeply into this, it seems that the
whiteout handling was not correct at the time of 1.0, and these changes
were not backported to 1.0 so the failure is not surprising:
https://issues.apache.org/jira/browse/MESOS-6360

Also successfully ran the `test-upgrade.py` script both from 0.28.3 ->
1.0.4-rc1 and from 1.0.4-rc1 -> 1.1.1

Cheers,
Greg


On Mon, Apr 24, 2017 at 3:23 PM, Vinod Kone  wrote:

> +1 (binding)
>
> Tested on ASF CI.
>
> *Revision*: 71e41f166f671c988e36c1bf04728ec3589eb509
>
>- refs/tags/1.0.4-rc1
>
> Configuration Matrix gcc clang
> centos:7 --verbose --enable-libevent --enable-ssl autotools
> [image: Success]
> 
> [image: Not run]
> cmake
> [image: Success]
> 
> [image: Not run]
> --verbose autotools
> [image: Success]
> 
> [image: Not run]
> cmake
> [image: Success]
> 
> [image: Not run]
> ubuntu:14.04 --verbose --enable-libevent --enable-ssl autotools
> [image: Success]
> 
> [image: Success]
> 
> cmake
> [image: Success]
> 
> [image: Success]
> 
> --verbose autotools
> [image: Success]
> 
> [image: Success]
> 
> cmake
> [image: Success]
> 
> [image: Success]
> 
>
> On Mon, Apr 17, 2017 at 4:49 PM, Adam Bordelon  wrote:
>
>> -0, wish we could include the fix for https://issues.apache.org/jira
>> /browse/MESOS-7265 in 1.0.4, but I won't hold the release for it.
>>
>> On Mon, Apr 17, 2017 at 3:44 PM, Vinod Kone  wrote:
>>
>>> Hi all,
>>>
>>> Please vote on releasing the following candidate as Apache Mesos 1.0.4.
>>>
>>>
>>> 1.0.4 includes the following:
>>>
>>> 
>>> 
>>>
>>> * [MESOS-2537] - AC_ARG_ENABLED 

Re: Welcome Kevin Klues as a Mesos Committer and PMC member!

2017-03-01 Thread Greg Mann
Woowoo! Congrats Kevin!!

On Wed, Mar 1, 2017 at 2:26 PM, Avinash Sridharan 
wrote:

> Awesome !! Congrats Kevin !!
>
> On Wed, Mar 1, 2017 at 2:07 PM, Jie Yu  wrote:
>
>> Congrats! Kevin! Well deserved!
>>
>> On Wed, Mar 1, 2017 at 2:05 PM, Benjamin Mahler 
>> wrote:
>>
>> > Hi all,
>> >
>> > Please welcome Kevin Klues as the newest committer and PMC member of the
>> > Apache Mesos project.
>> >
>> > Kevin has been an active contributor in the project for over a year,
>> and in
>> > this time he made a number of contributions to the project: Nvidia GPU
>> > support [1], the containerization side of POD support (new container
>> init
>> > process), and support for "attach" and "exec" of commands within running
>> > containers [2].
>> >
>> > Also, Kevin took on an effort with Haris Choudhary to revive the CLI [3]
>> > via a better structured python implementation (to be more accessible to
>> > contributors) and a more extensible architecture to better support
>> adding
>> > new or custom subcommands. The work also adds a unit test framework for
>> the
>> > CLI functionality (we had no tests previously!). I think it's great that
>> > Kevin took on this much needed improvement with Haris, and I'm very much
>> > looking forward to seeing this land in the project.
>> >
>> > Here is his committer eligibility document for perusal:
>> > https://docs.google.com/document/d/1mlO1yyLCoCSd85XeDKIxTYyboK_
>> > uiOJ4Uwr6ruKTlFM/edit
>> >
>> > Thanks!
>> > Ben
>> >
>> > [1] http://mesos.apache.org/documentation/latest/gpu-support/
>> > [2]
>> > https://docs.google.com/document/d/1nAVr0sSSpbDLrgUlAEB5hKzCl482N
>> > SVk8V0D56sFMzU
>> > [3]
>> > https://docs.google.com/document/d/1r6Iv4Efu8v8IBrcUTjgYkvZ32WVsc
>> > gYqrD07OyIglsA/
>> >
>>
>
>
>
> --
> Avinash Sridharan, Mesosphere
> +1 (323) 702 5245 <(323)%20702-5245>
>


Re: [VOTE] Release Apache Mesos 1.2.0 (rc2)

2017-03-01 Thread Greg Mann
I wanted to give a heads up on a flaky test failure I've encountered while
testing this RC: 'DockerRuntimeIsolatorTest.ROO
T_INTERNET_CURL_DockerDefaultEntryptRegistryPuller'. One issue related to
this test was resolved recently (https://issues.apache.org/
jira/browse/MESOS-6001), but this seems to be a separate issue (
https://issues.apache.org/jira/browse/MESOS-7185). I haven't had time to
triage yet so I'm not sure if this represents a legitimate bug, but I
thought I'd email here to increase visibility while the vote is out.

Cheers,
Greg


On Fri, Feb 24, 2017 at 1:14 AM, Adam Bordelon  wrote:

> Dear Mesos developers and users,
>
> Please vote on releasing the following candidate as Apache Mesos 1.2.0.
>
> 1.2.0 includes the following:
> 
> 
>   * [MESOS-5931] - **Experimental** Support auto backend in Mesos
> Containerizer,
> prefering overlayfs then aufs. Please note that the bind backend needs
> to be
> specified explicitly through the agent flag
> '--image_provisioner_backend'
> since it requires the sandbox already existed.
>
>   * [MESOS-6402] - **Experimental** Add rlimit support to Mesos
> containerizer.
> The isolator adds support for setting POSIX resource limits (rlimits)
> for
> containers launched using the Mesos containerizer. POSIX rlimits can be
> used
> to control the resources a process can consume. See `docs/
> posix_rlimits.md`
> for details.
>
>   * [MESOS-6419] - **Experimental** Teardown unregistered frameworks. The
> master
> now treats recovered frameworks very similarly to frameworks that are
> registered
> but currently disconnected. For example, recovered frameworks will be
> reported
> via the normal "frameworks" key when querying HTTP endpoints. This
> means there
> is no longer a concept of "orphan tasks": if the master knows about a
> task, the
> task will be running under a framework. Similarly, "teardown"
> operations on
> recovered frameworks will now work correctly.
>
>   * [MESOS-6460] - **Experimental** Container Attach and Exec. This feature
> adds
> new Agent APIs for attaching a remote client to the stdin, stdout, and
> stderr
> of a running Mesos task, as well as an API for launching new processes
> inside
> the same container as a running Mesos task and attaching to its stdin,
> stdout,
> and stderr. At a high level, these APIs mimic functionality similar to
> docker
> attach and docker exec. The primary motivation for such functionality
> is to
> enable users to debug their running Mesos tasks.
>
>   * [MESOS-6758] - **Experimental** Support 'Basic' auth docker private
> registry
> on Mesos Containerizer. Until now, the mesos containerizer always
> assumed
> Bearer auth, but we now also support basic auth for private registries.
> Please
> note that the AWS ECS uses Basic authorization but it does not work yet
> due to
> the redirect issue MESOS-5172.
>
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_p
> lain;f=CHANGELOG;hb=1.2.0-rc2
> 
> 
>
> The candidate for Mesos 1.2.0 release is available at:
> https://dist.apache.org/repos/dist/dev/mesos/1.2.0-rc2/mesos-1.2.0.tar.gz
>
> The tag to be voted on is 1.2.0-rc2:
> https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.2.0-rc2
>
> The MD5 checksum of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.2.0-rc2/mesos
> -1.2.0.tar.gz.md5
>
> The signature of the tarball can be found at:
> https://dist.apache.org/repos/dist/dev/mesos/1.2.0-rc2/mesos
> -1.2.0.tar.gz.asc
>
> The PGP key used to sign the release is here:
> https://dist.apache.org/repos/dist/release/mesos/KEYS
>
> The JAR is up in Maven in a staging repository here:
> https://repository.apache.org/content/repositories/orgapachemesos-1180
>
> Please vote on releasing this package as Apache Mesos 1.2.0!
>
> The vote is open until Wed Mar 1 18:00 PST 2017 and passes if a majority of
> at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Mesos 1.2.0
> [ ] -1 Do not release this package because ...
>
> Thanks,
> -Adam-
>


Re: Welcome Neil Conway as Mesos Committer and PMC member!

2017-01-23 Thread Greg Mann
Congratulations Neil!!! :D

On Sun, Jan 22, 2017 at 4:46 PM, Neil Conway  wrote:

> Thanks for the kind words, everyone! It's been a pleasure to be a part
> of the Mesos community, and I'm looking forward to continuing to
> contribute.
>
> Neil
>
> On Sun, Jan 22, 2017 at 2:16 PM, Benjamin Mahler 
> wrote:
> > Congrats and welcome!
> >
> > On Fri, Jan 20, 2017 at 11:03 PM, Vinod Kone 
> wrote:
> >
> >> Hi folks,
> >>
> >> Please welcome Neil Conway as the newest committer and PMC member of the
> >> Apache Mesos project.
> >>
> >> Neil has been an active contributor to Mesos for more than a year now.
> As
> >> part of his work, he has contributed some major features (Partition
> aware
> >> frameworks, floating point operations for resources). Neil also took the
> >> initiative to improve the documentation of our project and shepherded
> >> several improvements over time. Doing that even without being a
> committer,
> >> shows that he takes ownership of the project seriously.
> >>
> >> Here is his more formal checklist for your perusal.
> >>
> >> https://docs.google.com/document/d/137MYwxEw9QCZRH09CXfn1544
> >> p1LuMuoj9LxS-sk2_F4/edit
> >>
> >> Thanks,
> >> Vinod
> >>
>


Re: Question on dynamic reservations

2017-01-17 Thread Greg Mann
Thanks Gabriel, that makes sense. It sounds like labels on static
reservations might be the most expedient path toward a solution to this
problem, but that is not without its complications, as suggested in the
related ticket which Neil filed a while back:
https://issues.apache.org/jira/browse/MESOS-4476

Povilas, also see this related ticket that Gabriel pointed me to:
https://issues.apache.org/jira/browse/MESOS-6939

It sounds like this is a real issue for stateful framework developers, so
hopefully we will find some time soon to implement a solution. In the
meantime, Povilas, I'm afraid to say I don't know exactly what solution to
recommend. If anybody else in the community has some ideas, it would be
great to hear them :)

Cheers,
Greg


On Tue, Jan 17, 2017 at 2:52 PM, Gabriel Hartmann <gabr...@mesosphere.io>
wrote:

> @Greg: The reason people use static reservation is to enforce that
> particular resources (usually disks) can only be consumed by a particular
> framework.  They also don't know when the stateful service is going to be
> installed necessarily so they don't want to race with other frameworks to
> consume those special resources.  So static reservation is desirable.
> However, all stateful services also need more information about reserved
> resources than is natively provided by Mesos in the static reservation case
> (i.e. the labels he describes).  `dcos-commons` does the same thing.
> Various work arounds exist, but none are able to provide resource
> allocation enforcement because only roles do that.  An alternate resource
> allocation enforcement mechanism is needed.  Usually this is the part where
> people start talking about quota.
>
> Neither option 1 nor option 2 provided a race proof way to get fully
> labeled reserved resources.  It's been proposed in the past that it be
> allowed to add labels to statically reserved resources.  That's kind of
> fine except now you have these things that can't really be UNRESERVEd but
> look exactly like dynamic resources which can...
>
> Quota w/ chunks as a step in the deployment of stateful services is very
> desirable in an adversarial environment.  However if your'e in a
> cooperative environment (i.e. you're not in an adversarial relationship
> with other frameworks) if you had resources (particularly disk resources)
> with attributes on them you could have frameworks voluntarily choose not to
> consume resources not meant for them.
>
> e.g. Disk resource has attribute `CASSANDRA`.  Ok, since I'm a Kafka
> framework I won't go use that disk.
>
> On Tue, Jan 17, 2017 at 11:24 AM Greg Mann <g...@mesosphere.io> wrote:
>
>> Hi Povilas,
>> Another approach you could try is to use dynamic reservations only. You
>> could either:
>>
>>1. Alter your stateful framework to dynamically reserve the resources
>>that it needs, or
>>2. Add a script to your cluster tooling that would make use of the
>>operator endpoint for dynamic reservations [1]
>><http://mesos.apache.org/documentation/latest/reservation/> to
>>dynamically reserve the stateful framework's resources when your cluster 
>> is
>>initially provisioned. This would have a similar effect to static
>>reservations, but would allow you to set labels
>>
>> Approach #1 makes sense to me; is there a reason that it's not feasible
>> for your stateful framework to dynamically reserve its own resources? This
>> is the typical workflow that I would recommend. I'm not too familiar with
>> Aurora, so perhaps it's adding some complexity that I'm unaware of?
>>
>> Cheers,
>> Greg
>>
>> [1] http://mesos.apache.org/documentation/latest/reservation/
>>
>>
>> On Tue, Jan 17, 2017 at 12:28 AM, Povilas Versockas <
>> p.versoc...@gmail.com> wrote:
>>
>> Hey,
>>
>> Thanks for writing me back!
>>
>> Maybe there is some other method to solve this problem on statically
>> reserved cluster? The solution could be making agent's resources appear as
>> unreserved resources to only selected framework. I can see that mesos-agent
>> has --acls flag, so maybe tinkering with this could help me. Of course it
>> is possible to implement this in the framework scheduler, but this will add
>> way more clunkiness to the code. It feels like this kind of resource
>> management should be part of Mesos. Maybe I'm missing something?
>>
>>
>>
>> On Mon, Jan 16, 2017 at 4:58 PM, haosdent <haosd...@gmail.com> wrote:
>>
>> Hi, @Povilas It is possible to dynamic reserve unreserved resources on
>> those agents.
>>
>> On Fri, Jan 13, 2017 at 2:47 PM, Povilas Versockas <p.versoc...@gmail.com
>> >

Re: Question on dynamic reservations

2017-01-17 Thread Greg Mann
Hi Povilas,
Another approach you could try is to use dynamic reservations only. You
could either:

   1. Alter your stateful framework to dynamically reserve the resources
   that it needs, or
   2. Add a script to your cluster tooling that would make use of the
   operator endpoint for dynamic reservations [1]
    to
   dynamically reserve the stateful framework's resources when your cluster is
   initially provisioned. This would have a similar effect to static
   reservations, but would allow you to set labels

Approach #1 makes sense to me; is there a reason that it's not feasible for
your stateful framework to dynamically reserve its own resources? This is
the typical workflow that I would recommend. I'm not too familiar with
Aurora, so perhaps it's adding some complexity that I'm unaware of?

Cheers,
Greg

[1] http://mesos.apache.org/documentation/latest/reservation/


On Tue, Jan 17, 2017 at 12:28 AM, Povilas Versockas 
wrote:

> Hey,
>
> Thanks for writing me back!
>
> Maybe there is some other method to solve this problem on statically
> reserved cluster? The solution could be making agent's resources appear as
> unreserved resources to only selected framework. I can see that mesos-agent
> has --acls flag, so maybe tinkering with this could help me. Of course it
> is possible to implement this in the framework scheduler, but this will add
> way more clunkiness to the code. It feels like this kind of resource
> management should be part of Mesos. Maybe I'm missing something?
>
>
>
> On Mon, Jan 16, 2017 at 4:58 PM, haosdent  wrote:
>
>> Hi, @Povilas It is possible to dynamic reserve unreserved resources on
>> those agents.
>>
>> On Fri, Jan 13, 2017 at 2:47 PM, Povilas Versockas > > wrote:
>>
>>> Hi,
>>>
>>> Maybe someone can help me with a problem I'm having. Short version of
>>> the question is:
>>> Is it possible to use dynamic reservation on statically reserved Mesos
>>> agents?
>>>
>>> The current situation is that we have Mesos cluster which runs many
>>> frameworks (aurora, spark, cassandra) and we are developing a custom
>>> framework for stateful tasks. Our framework manages stateful tasks for many
>>> users. Currently we statically reserved our hardware which has good disks
>>> only to be used by our framework (via --resources flag on Mesos Agents).
>>>
>>> The problem we are facing is that if one stateful task fails we would
>>> like to relaunch it on the same host with the same port, cpu, disk and
>>> memory.
>>> With dynamic reservations we would put a label with task id on a
>>> reservation and on failure would just simply reuse the reserved offer.
>>> On the other hand with statically reserved Mesos agents we cannot put
>>> any labels and so we cannot distinguish offers which should have been
>>> reserved for a task and a new offer.
>>> This leaves us in the situation that if one stateful task fails and
>>> there are new stateful tasks, the new tasks can be scheduled on failed
>>> task's Mesos agent, filling it up and taking it's port, cpu and memory.
>>>
>>>
>>> --
>>> Regards
>>> Povilas Versockas
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Pagarbiai
> Povilas Versockas
>


Re: [MESOS-6240] Allow executor/agent communication over non-TCP/IP stream socket.

2017-01-06 Thread Greg Mann
Hi Bingqiang,
I've had some recent discussions with folks about this feature; it's
something we are interested in doing but I'm not sure what its priority is
in relation to other tickets.

While the AF_UNIX address type has been added to libprocess, libprocess
does not currently accept domain socket connections. Domain sockets are
used by the I/O switchboard in the Mesos agent to communicate with running
containers, but this is mostly application-level code. Libprocess currently
exposes a single server-side TCP socket [1] for incoming connections. To
support executor communication via domain sockets, we would need to add a
new server-side domain socket to libprocess.

Are there any particular goals or use cases you have in mind for this
feature?

Cheers,
Greg

[1]
https://github.com/apache/mesos/blob/be127e6eca1312bf8c2b039646f6909fa42cd342/3rdparty/libprocess/src/process.cpp#L571



On Mon, Dec 19, 2016 at 3:54 AM, pangbingqiang 
wrote:

> Hi all:
>
>What’s the latest information about MESOS-6240
> https://issues.apache.org/jira/browse/MESOS-6240 ,have any demo or design
> achieve?
>
> I see libprocess have support domain socket communication, does agent and
> executor have support communication by domain socket too?
>
> If have any related imformation, please let me know, thanks~.
>
>
>
> [image: cid:image001.png@01D0E8C5.8D08F440]
>
>
>
> Bingqiang Pang(庞兵强)
>
>
>
> Distributed and Parallel Software Lab
>
> Huawei Technologies Co., Ltd.
>
> Email:pangbingqi...@huawei.com 
>
>
>
>
>


Re: [Design Doc] [RFC] Executor Authentication

2017-01-04 Thread Greg Mann
Hello all,
I wanted to bump up this thread since it was sent just before the end of
the year. You can find in the previous email a link to the executor
authentication design doc I've been working on. Thank you to those folks
who have already chimed in with comments!

Cheers,
Greg


On Fri, Dec 23, 2016 at 7:00 PM, Greg Mann <g...@mesosphere.io> wrote:

> Hello all,
> As part of the continuing effort to secure all communication in a Mesos
> cluster, we would like to add authentication to the executor HTTP API.
> Linked below is a design document draft for this feature; I would love to
> get the community's feedback! Feel free to leave comments on the Google
> doc, as well as high-level discussion here on this thread.
>
> Considering the timing, I will leave the design doc JIRA open for comments
> for at least a couple weeks, so that people have time after the new year to
> comment. I'll send out another email on this thread in the new year.
>
> Design doc:
> https://docs.google.com/document/d/12GMJ7VGGMKsMz4JZK-
> 2fblAJhvYlJhVUV8aF9fNh8qQ/edit?usp=sharing
>
> JIRA Epic:
> https://issues.apache.org/jira/browse/MESOS-6365
>
> Cheers,
> Greg
>


[Design Doc] [RFC] Executor Authentication

2016-12-23 Thread Greg Mann
Hello all,
As part of the continuing effort to secure all communication in a Mesos
cluster, we would like to add authentication to the executor HTTP API.
Linked below is a design document draft for this feature; I would love to
get the community's feedback! Feel free to leave comments on the Google
doc, as well as high-level discussion here on this thread.

Considering the timing, I will leave the design doc JIRA open for comments
for at least a couple weeks, so that people have time after the new year to
comment. I'll send out another email on this thread in the new year.

Design doc:
https://docs.google.com/document/d/12GMJ7VGGMKsMz4JZK-2fblAJhvYlJhVUV8aF9fNh8qQ/edit?usp=sharing

JIRA Epic:
https://issues.apache.org/jira/browse/MESOS-6365

Cheers,
Greg


  1   2   >