Re: [VOTE] Release Apache Aurora 0.19.0 RC0

2017-11-10 Thread Erb, Stephan
+1 from me. 

Verification script has passed. I also intended to deploy this to a test 
cluster, but won’t be able to do so before the vote is closing.

On 09.11.17, 17:16, "Bill Farner"  wrote:

Friendly reminder to vote, folks!  We are currently one binding vote shy of
a release, and the vote closes tomorrow!

If anyone else is getting stuck on the macOS build, a workaround is to
verify from vagrant:

$ vagrant up
$ vagrant ssh
$ cd /vagrant
$ ./build-support/release/verify-release-candidate 0.19.0-rc0



On Wed, Nov 8, 2017 at 10:48 AM, David McLaughlin 
wrote:

> +1 from me. The Mac OS breakage is disappointing, but I'm fine with it not
> being a blocker.
>
> On Tue, Nov 7, 2017 at 11:04 PM, Mohit Jaggi  wrote:
>
> > +1
> >
> > On Tue, Nov 7, 2017 at 10:51 PM, Bill Farner  wrote:
> >
> > > +1
> > >
> > > Successfully validated with ./build-support/release/
> > > verify-release-candidate
> > > 0.19.0-rc0
> > >
> > > Note that the above command fails on macOS due to AURORA-1956
> > > .  However, i am
> > still
> > > a
> > > +1 since i see mac builds as developer convenience rather than a
> > supported
> > > environment.  Others are welcome to feel differently.
> > >
> > > On Tue, Nov 7, 2017 at 8:49 PM, Bill Farner 
> wrote:
> > >
> > > > All,
> > > >
> > > > I propose that we accept the following release candidate as the
> > official
> > > > Apache Aurora 0.19.0 release.
> > > >
> > > > Aurora 0.19.0-rc0 includes the following:
> > > > ---
> > > > The RELEASE NOTES for the release are available at:
> > > > https://git-wip-us.apache.org/repos/asf?p=aurora.git=
> > > > RELEASE-NOTES.md=rel/0.19.0-rc0
> > > >
> > > > The CHANGELOG for the release is available at:
> > > > https://git-wip-us.apache.org/repos/asf?p=aurora.git=
> > > > CHANGELOG=rel/0.19.0-rc0
> > > >
> > > > The tag used to create the release candidate is:
> > > > https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=
> > > > shortlog;h=refs/tags/rel/0.19.0-rc0
> > > >
> > > > The release candidate is available at:
> > > > https://dist.apache.org/repos/dist/dev/aurora/0.19.0-rc0/
> > > > apache-aurora-0.19.0-rc0.tar.gz
> > > >
> > > > The MD5 checksum of the release candidate can be found at:
> > > > https://dist.apache.org/repos/dist/dev/aurora/0.19.0-rc0/
> > > > apache-aurora-0.19.0-rc0.tar.gz.md5
> > > >
> > > > The signature of the release candidate can be found at:
> > > > https://dist.apache.org/repos/dist/dev/aurora/0.19.0-rc0/
> > > > apache-aurora-0.19.0-rc0.tar.gz.asc
> > > >
> > > > The GPG key used to sign the release are available at:
> > > > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> > > >
> > > > Please download, verify, and test.
> > > >
> > > > The vote will close on Fri Nov 10 20:48:05 PST 2017
> > > >
> > > > [ ] +1 Release this as Apache Aurora 0.19.0
> > > > [ ] +0
> > > > [ ] -1 Do not release this as Apache Aurora 0.19.0 because...
> > > >
> > >
> >
>




Re: 0.19.0 release preparation

2017-10-30 Thread Erb, Stephan
Sounds good to me. Getting the release out quickly will allow us to remove the 
old mybatis/h2 code sooner.

I planned on upgrading to Mesos 1.4. Unfortunately this is currently blocked by 
a missing mesos.interface package on PyPI. I send a mail out to the Mesos Dev 
list but I am still waiting for a response. So this will have to wait for 0.20.

On 29.10.17, 23:15, "Bill Farner"  wrote:

Folks,

I propose we cut our 0.19.0 release soon.  We have built up a respectable
set of unreleased changes
.  I am
happy to perform this release, and should be able to do so as soon as this
week.

Please chime in here if there is any outstanding work that should block a
release.


Cheers,

Bill




Re: Build failed in Jenkins: Aurora #1858

2017-10-23 Thread Erb, Stephan
Ah, again a node with insufficient memory. I once added a mechanism to abort 
the build early rather than running and eventually failing in these cases. This 
was very helpful for the regular reviewbot but is not that helpful for the 
normal SCM-triggerd build. 

Can anyone think of a better way to handle this case here? 


On 23.10.17, 22:02, "Apache Jenkins Server"  wrote:

See 


Changes:

[david] Add sorting and filtering controls for TaskList

--
Started by an SCM change
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building remotely on ubuntu-4 (ubuntu trusty) in workspace 

Wiping out workspace first.
Cloning the remote Git repository
Cloning repository https://git-wip-us.apache.org/repos/asf/aurora.git
 > git init  # timeout=10
Fetching upstream changes from 
https://git-wip-us.apache.org/repos/asf/aurora.git
 > git --version # timeout=10
 > git fetch --tags --progress 
https://git-wip-us.apache.org/repos/asf/aurora.git 
+refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url 
https://git-wip-us.apache.org/repos/asf/aurora.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* 
# timeout=10
 > git config remote.origin.url 
https://git-wip-us.apache.org/repos/asf/aurora.git # timeout=10
Fetching upstream changes from 
https://git-wip-us.apache.org/repos/asf/aurora.git
 > git fetch --tags --progress 
https://git-wip-us.apache.org/repos/asf/aurora.git 
+refs/heads/*:refs/remotes/origin/*
 > git rev-parse origin/master^{commit} # timeout=10
Checking out Revision 5b91150fd0668c23b178d80516427763764ac2d3 
(origin/master)
Commit message: "Add sorting and filtering controls for TaskList"
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 5b91150fd0668c23b178d80516427763764ac2d3
 > git rev-list ec640117c273f51e26089cd83ba325be9e8a0e89 # timeout=10
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
[Aurora] $ /bin/bash -xe /tmp/jenkins2427407600627764864.sh
+ export HOME=
+ HOME=
++ awk '/^MemAvailable:/{print $2}' /proc/meminfo
+ available_mem_k=
+ echo

+ threshold_mem_k=4194304
+ ((  threshold_mem_k > available_mem_k  ))
+ echo 'Less than 4 GiB memory available. Bailing.'
Less than 4 GiB memory available. Bailing.
+ exit 1
Build step 'Execute shell' marked build as failure
Recording test results
ERROR: Step ?Publish JUnit test result report? failed: No test report files 
were found. Configuration error?





Re: Future of storage in Aurora

2017-10-02 Thread Erb, Stephan
What do you have in mind for the in-memory replacement? Revert back to the 
usage of thrift objects within plain Java containers like we do for the task 
store?

On 02.10.17, 00:59, "Bill Farner"  wrote:

I would like to revive this discussion in light of some work i have been
doing around the storage system.  The fruits of the DB storage system will
require a lot of additional effort to reach the beneficial outcomes i laid
out above, and i agree that we should cut our losses.

I plan to introduce patches soon to introduce non-H2 in-memory store
implementations.  *If anyone disagrees with removing the H2 implementations
as well, please chime in here.*

Disclaimer - i may propose an alternative for the persistent storage in the
near future.

On Mon, Apr 3, 2017 at 9:40 AM, Stephan Erb  wrote:

> H2 could give us fine granular data access. However, most of our code
> performs massive joins to reconstruct fully hydrated thrift objects.
> Most of the time we are then only interested in very few properties of
> those thrift structs. This applies to internal usage, but also how we
> use the API.
>
> I therefore believe we have to improve and refine our domain model in
> order to significantly improve the storage situation.
>
> I really liked Maxim's proposal from last year, and I think it is worth
> reconsidering: https://docs.google.com/document/d/1myYX3yuofGr8JIzud98x
> Xd5mqgpZ8q_RqKBpSff4-WE/edit
>
> Best regards,
> Stephan
>
> On Thu, 2017-03-30 at 15:53 -0700, David McLaughlin wrote:
> > So it sounds like before we make any decisions around removing the
> > work
> > done in H2 so far, we should figure out what is remaining to move to
> > external storage (or if it's even still a goal).
> >
> > I may still play around with reviving the in-memory stores, but will
> > separate that work from any goal to remove the H2 layer. Since it's
> > motivated by performance, I'd verify there is a benefit before
> > submitting
> > any review.
> >
> > Thanks all for the feedback.
> >
> >
> > On Thu, Mar 30, 2017 at 12:08 PM, Bill Farner  > m>
> > wrote:
> >
> > > Adding some background - there were several motivators to using SQL
> > > that
> > > come to mind:
> > > a) well-understood transaction isolation guarantees leading to a
> > > simpler
> > > programming model w.r.t. concurrency
> > > b) ability to offload storage to a separate system (e.g. Postgres)
> > > and
> > > scale it separately
> > > c) relief of computational burden of performing snapshots and
> > > backups due
> > > to (b)
> > > d) simpler code and operations model due to (b)
> > > e) schema backwards compatibility guarantees due to persistence-
> > > friendly
> > > migration-scripts
> > > f) straightforward normalization to facilitate sharing of
> > > otherwise-redundant state (I.e. TaskConfig)
> > >
> > > The storage overhaul comes with a huge caveat requiring the
> > > approach to
> > > scheduling rounds to change. I concur that the current model is
> > > hostile to
> > > offloaded storage, as ~all state must be read every scheduling
> > > round. If
> > > that cannot be worked around with lazy state or best-effort
> > > concurrency
> > > (I.e. in-memory caching), the approach is indeed flawed.
> > >
> > > On Mar 30, 2017, 10:29 AM -0700, Joshua Cohen ,
> > > wrote:
> > > > My understanding of the H2-backed stores is that at least part of
> > > > the
> > > > original rationale behind them was that they were meant to be an
> > > > interim
> > > > point on the way to external SQL-backed stores which should
> > > > theoretically
> > > > provide significant benefits w.r.t. to GC (obviously unproven,
> > > > especially
> > > > at scale).
> > > >
> > > > I don't disagree that the H2 stores themselves are problematic
> > > > (to say
> > >
> > > the
> > > > least); do we have evidence that returning to memory based stores
> > > > will be
> > > > an improvement on that?
> > > >
> > > > On Thu, Mar 30, 2017 at 12:16 PM, David McLaughlin <
> > >
> > > dmclaugh...@apache.org
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I'd like to start a discussion around storage in Aurora.
> > > > >
> > > > > I think one of the biggest mistakes we made in migrating our
> > > > > storage
> > >
> > > to H2
> > > > > was deleting the memory stores as we moved. We made a pretty
> > > > > big bet
> > >
> > > that
> > > > > we could eventually make H2/relational databases work. I don't
> > > > > think
> > >
> > > that

Re: Make drain MAX_STATUS_WAIT configurable

2017-09-05 Thread Erb, Stephan
+1

On 05.09.17, 19:17, "David McLaughlin"  wrote:

+1

On Tue, Sep 5, 2017 at 10:09 AM, Mauricio Garavaglia <
mauriciogaravag...@gmail.com> wrote:

> Hi folks,
>
> The aurora-admin drain command currently has a hardcoded limit of 5 
minutes
> waiting for a node to be drained, after that timeout it fails.
>
> This doesn't work very well when tasks are expected be in killing state 
for
> more than that, for example, if the scheduler transient_task_state_timeout
> was
> adjusted.
>
> Any objection on making this 5 minutes MAX_STATUS_WAIT in aurora-admin
> drain configurable?
>
> https://github.com/apache/aurora/blob/master/src/main/
> python/apache/aurora/admin/host_maintenance.py#L40
>
> Thanks,
>
> Mauricio
>




Re: Redesign of the Aurora UI

2017-07-23 Thread Erb, Stephan
A big +1 from me as well. We have not touched or updated the existing UI for 
quite some time, which is a bad sign for code health.

I would even be OK with a couple of bigger initial code dumps. I am not really 
a web-developer, so a working piece of code to play around with would probably 
be the fastest way to get up to speed with the tech and its usage in Aurora. 

Thanks a lot for driving this, David!

On 21.07.17, 07:00, "Kai Huang"  wrote:

David - Sure, let's sync on the work when you are ready.



From: David McLaughlin 
Sent: Wednesday, July 19, 2017 14:10
To: dev@aurora.apache.org
Subject: Re: Redesign of the Aurora UI

Thanks for the feedback!

Joshua - I haven't tried to drop in Preact yet, but I was also planning to
throw away the prototype and starting again when it came to upstreaming it,
so as part of that we can just address incompatibilities as we go. If I was
to guess, then the only significant impact on my prototype would probably
be the reactable plugin I was using (replacement for Angular's
smart-table). But longer term I do have concerns about moving away from
what is a constantly improving and healthy ecosystem around React. So most
likely I'll hold off until a decision is made there one way or the other
(which should be within a week).


Kai - I'd be happy to coordinate and collaborate on this with others. Let
me try and finish up the CSS/UX of the pages in my prototype and from there
we can sync on who does what. Does that sound good?




On Wed, Jul 19, 2017 at 1:56 PM, Kai Huang  wrote:

> Just a few thoughts as an aurora developer and operator:
>
>
> From my experiences with Aurora users, some persisting complaints are:
>
>   1.  The current UI is not very intuitive for the users to understand the
> task lifecycle, resource utilization of their job.
>   2.  Often times users are unaware of the new features/changes in Aurora
> Scheduler/Executor, which leads to a lot of misuse of the system.
>   3.  Users have preferences on the appearance of the scheduler/thermos UI
> due to special use cases, and ask us to customize for them(or start to
> write their own UI, which is often not recommended).
>
>
> The other major issue I see in the current UI is that it's built on an
> obsolete tech stack(AngularJS) that has all the binaries and dependencies
> in the repo. From a developer's perspective, it's a big burden to
> maintain/test the code, and make fast iterations on it.
>
>
> Currently the scheduler UI is readonly, and mainly designed for debugging
> purposes. We could have done much better to make the UI more friendly to
> the end user, empower them to discover, understand and use all the Aurora
> features, and give them more insights into the their jobs, or even the
> entire cluster. I love the idea that redesign the UI with a modern stack,
> and more importantly, every single part of the application is a module 
that
> you could take your own customization.
>
>
> As for the development strategy, I'm in favor of the incremental approach
> that posts one page at a time. The main benefit is that we are educating
> the developers while iterating on it, and this will improve the adoption
> rate in the long term.
>
>
> Overall I'm very interested in this work, and would like to collaborate
> with David to redesign and improve the UI.
>
>
> 
> From: Joshua Cohen 
> Sent: Wednesday, July 19, 2017 11:39
> To: dev@aurora.apache.org
> Subject: Re: Redesign of the Aurora UI
>
> I think this looks great overall! I'm super excited to see the UI get some
> love (and to set us up for better iteration on the UI going forward). My
> biggest concern, of course, is the current hubbub vis-a-vis Apache and the
> BSD-3+Patents license[1]. Have you tried running this against Preact to
> confirm compatibility/performance? Also note that other Facebook libraries
> have the same license problem (e.g. Immutable, Jest), so unless FB changes
> their patent grant clause, I imagine we'd have to find alternatives to
> those as well. If only we had landed this a week ago, we could've been
> grandfathered in on the license front :(.
>
> As far as the options for landing this, I'll leave that up to more active
> reviewers, but my gut says that smaller reviews will make this easier to
> parse, especially for those unfamiliar with React. That said, perhaps we
> could go with an alternate method for reviewing here, where people review
> against your fork directly and only when they're comfortable do you 

Re: [VOTE] Release Apache Aurora 0.18.x packages

2017-07-22 Thread Erb, Stephan
+1 

The verification scripts for all distributions in the test repository have 
passed for me.


On 19.07.17, 01:58, "Santhosh Kumar Shanmugham" 
 wrote:

I missed to update one of the bintray links.

It should be https://dl.bintray.com/shanmugh/aurora/

On Tue, Jul 18, 2017 at 3:43 PM, Santhosh Kumar Shanmugham <
sshanmug...@twitter.com> wrote:

> Kicking off the votes.
>
> +1
>
> On Tue, Jul 18, 2017 at 3:42 PM, Santhosh Kumar Shanmugham <
> sshanmug...@twitter.com> wrote:
>
>> All,
>>
>> I propose that we accept the following artifacts as the official deb and
>> rpm packaging for
>> Apache Aurora 0.18.x:
>>
>> https://dl.bintray.com/sshanmugham/aurora/
>>
>> The Aurora deb and rpm packaging includes the following:
>>
>> ---
>>
>> The branch used to create the packaging is:
>> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging
>> .git;a=tree;hb=refs/heads/0.18.x
>>
>> The packages are available at:
>> https://dl.bintray.com/shanmugh/aurora/
>>
>> The GPG keys used to sign the packages are available at:
>> https://dist.apache.org/repos/dist/release/aurora/KEYS
>>
>> Please download, verify, and test. Detailed test instructions are
>> available here:
>> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging
>> .git;a=tree;f=test;hb=refs/heads/0.18.x
>>
>>
>> The vote will close on Fri Jul 21 14:30:27 PDT 2017
>>
>> [ ] +1 Release these as the deb and rpm packages for Apache Aurora 0.18.x
>> [ ] +0
>> [ ] -1 Do not release these artifacts because...
>>
>> 
>> 
>>
>> -Santhosh
>>
>
>




Re: [DRAFT][REPORT] Apache Aurora - June 2017

2017-06-14 Thread Erb, Stephan
Thanks Jake! Looks good to me. 

On 14.06.17, 02:52, "Jake Farrell"  wrote:

 Draft board report for June 2017, feedback for any changes welcome

-Jake



Apache Aurora is a stateless and fault tolerant service scheduler used to
schedule jobs onto Apache Mesos such as long-running services, cron jobs,
and one off tasks.

Project Status
-
The Apache Aurora community has been discussing different architecture
updates
to allow for dynamic reservations, rolling restarts, and
performance updates to our
storage engine while making progress on our 0.18.0 release candidate. Our
newest committer, Santhosh Shanmugham, has been shepherding the release
candidate through the process.

Community
---
Latest Additions:

* Committer addition: Santhosh Kumar Shanmugham, 2.9.2017
* PMC addition:  Mehrdad Nurolahzade, 2.24.2017

Issue backlog status since last report:

* Created:   30
* Resolved: 24

Mailing list activity since last report:

* @dev 57 messages
* @user8 messages
* @reviews  646 messages

Releases
---
Last release: Apache Aurora 0.17.0 released 02.06.2017
Vote is currently in progress for Apache Aurora 0.18.0-RC0




Aurora 0.18 Release Planning

2017-05-22 Thread Erb, Stephan
Hi everyone,

we should aim for the next release, so that we get our new features out and 
stay current with Mesos.

Any patches you want to land before 0.18 is cut?

In addition, anyone interested to volunteer as release manager? I can fill the 
role if necessary, but have already done so for the last release.

Best regards,
Stephan




Re: Design Doc for Mesos Maintenance in Aurora

2017-03-13 Thread Erb, Stephan
Looks good to me! 

On 08/03/2017, 02:45, "Zameer Manji"  wrote:

Hey,

I have a brief design doc


describing
the changes required to support Mesos Maintenance in Aurora.

If we have consensus, I will cut a ticket and put up a patch.

-- 
Zameer Manji




Dynamic Reservations

2017-03-02 Thread Erb, Stephan
Hi everyone,

There have been two documents on Dynamic Reservations as a first step towards 
persistent services:

· RFC: 
https://docs.google.com/document/d/15n29HSQPXuFrnxZAgfVINTRP1Iv47_jfcstJNuMwr5A/edit#heading=h.hcsc8tda08vy

· Technical Design Doc:  
https://docs.google.com/document/d/1L2EKEcKKBPmuxRviSUebyuqiNwaO-2hsITBjt3SgWvE/edit#heading=h.klg3urfbnq3v

Since a couple of days there are also now two patches online for a MVP by 
Dmitriy:

· https://reviews.apache.org/r/56690/

· https://reviews.apache.org/r/56691/

From reading the documents, I am under the impression that there is a rough 
consensus on the following points:

· We want dynamic reservations. Our general goal is to enable the 
re-scheduling of tasks on the same host they used in a previous run.

· Dynamic reservations are a best-effort feature. If in doubt, a task 
will be scheduled somewhere else.

· Jobs opt into reserved resources using an appropriate tier config.

· The tier config in supposed to be neither preemptible nor revocable. 
Reserving resources therefore requires appropriate quota.

· Aurora will tag reserved Mesos resources by adding the unique 
instance key of the reserving task instance as a label. Only this task instance 
will be allowed to use those tagged resources.

I am unclear on the following general questions as there is contradicting 
content:

a)   How does the user interact with reservations?  There are several 
proposals in the documents to auto-reserve on `aurora job create` or `aurora 
cron schedule` and to automatically un-reserve on the appropriate reverse 
actions. But will we also allow a user further control over the reservations so 
that they can manage those independent of the task/job lifecycle? For example, 
how does Borg handle this?

b)   The implementation proposal and patches include an OfferReconciler, so 
this implies we don’t want to offer any control for the user. The only control 
mechanism will be the cluster-wide offer wait time limiting the number of 
seconds unused reserved resources can linger before they are un-reserved.

c)   Will we allow adhoc/cron jobs to reserve resources? Does it even 
matter if we don’t give control to users and just rely on the OfferReconciler?


I have a couple of questions on the MVP and some implementation details. I will 
follow up with those in a separate mail.

Thanks and best regards,
Stephan


Re: Build failed in Jenkins: aurora-packaging-nightly #570

2017-02-08 Thread Erb, Stephan
Caused by Centos mirror trouble https://bugs.centos.org/view.php?id=12793=1

On 08/02/2017, 01:38, "Apache Jenkins Server"  wrote:

See 

--
[...truncated 880 lines...]
  Verifying  : libxslt-1.1.28-5.el7.x86_64   
93/140 
  Verifying  : systemd-sysv-219-30.el7_3.6.x86_64
94/140 
  Verifying  : libICE-1.0.9-2.el7.x86_64 
95/140 
  Verifying  : subversion-devel-1.7.14-10.el7.x86_64 
96/140 
  Verifying  : libgnome-keyring-3.8.0-3.el7.x86_64   
97/140 
  Verifying  : libcurl-devel-7.29.0-35.el7.centos.x86_64 
98/140 
  Verifying  : freetype-2.4.11-12.el7.x86_64 
99/140 
  Verifying  : perl-podlators-2.5.1-3.el7.noarch
100/140 
  Verifying  : apr-util-devel-1.5.2-6.el7.x86_64
101/140 
  Verifying  : alsa-lib-1.1.1-1.el7.x86_64  
102/140 
  Verifying  : mpfr-3.1.1-4.el7.x86_64  
103/140 
  Verifying  : perl-Filter-1.49-3.el7.x86_64
104/140 
  Verifying  : dwz-0.11-3.el7.x86_64
105/140 
  Verifying  : perl-Socket-2.010-4.el7.x86_64   
106/140 
  Verifying  : cyrus-sasl-devel-2.1.26-20.el7_2.x86_64  
107/140 
  Verifying  : pakchois-0.4-10.el7.x86_64   
108/140 
  Verifying  : rsync-3.0.9-17.el7.x86_64
109/140 
  Verifying  : fipscheck-1.4.1-5.el7.x86_64 
110/140 
  Verifying  : less-458-9.el7.x86_64
111/140 
  Verifying  : gcc-c++-4.8.5-11.el7.x86_64  
112/140 
  Verifying  : initscripts-9.49.37-1.el7.x86_64 
113/140 
  Verifying  : perl-Exporter-5.68-3.el7.noarch  
114/140 
  Verifying  : perl-constant-1.27-2.el7.noarch  
115/140 
  Verifying  : perl-PathTools-3.40-5.el7.x86_64 
116/140 
  Verifying  : 1:perl-Pod-Escapes-1.04-291.el7.noarch   
117/140 
  Verifying  : expat-devel-2.1.0-10.el7_3.x86_64
118/140 
  Verifying  : libcom_err-devel-1.42.9-9.el7.x86_64 
119/140 
  Verifying  : libffi-devel-3.0.13-18.el7.x86_64
120/140 
  Verifying  : apr-1.4.8-3.el7.x86_64   
121/140 
  Verifying  : subversion-1.7.14-10.el7.x86_64  
122/140 
  Verifying  : perl-Thread-Queue-3.02-2.el7.noarch  
123/140 
  Verifying  : 1:perl-Pod-Simple-3.28-4.el7.noarch  
124/140 
  Verifying  : perl-Time-Local-1.2300-2.el7.noarch  
125/140 
  Verifying  : libdb-devel-5.3.21-19.el7.x86_64 
126/140 
  Verifying  : perl-Pod-Perldoc-3.20-4.el7.noarch   
127/140 
  Verifying  : copy-jdk-configs-1.2-1.el7.noarch
128/140 
  Verifying  : glibc-devel-2.17-157.el7_3.1.x86_64  
129/140 
  Verifying  : xorg-x11-fonts-Type1-7.5-9.el7.noarch
130/140 
  Verifying  : 1:perl-Error-0.17020-2.el7.noarch
131/140 
  Verifying  : libXfont-1.5.1-2.el7.x86_64  
132/140 
  Verifying  : flex-2.5.37-3.el7.x86_64 
133/140 
  Verifying  : cyrus-sasl-2.1.26-20.el7_2.x86_64
134/140 
  Verifying  : openssh-clients-6.6.1p1-33.el7_3.x86_64  
135/140 
  Verifying  : kernel-headers-3.10.0-514.6.1.el7.x86_64 
136/140 
  Verifying  : perl-Getopt-Long-2.40-2.el7.noarch   
137/140 
  Verifying  : zip-3.0-11.el7.x86_64
138/140 
  Verifying  : 1:java-1.8.0-openjdk-1.8.0.121-0.b13.el7_3.x86_64
139/140 
  Verifying  : libgomp-4.8.5-11.el7.x86_64  
140/140 

Installed:
  apr-devel.x86_64 0:1.4.8-3.el7

  cyrus-sasl-devel.x86_64 0:2.1.26-20.el7_2 

  flex.x86_64 0:2.5.37-3.el7

  gcc.x86_64 0:4.8.5-11.el7 

  gcc-c++.x86_64 0:4.8.5-11.el7 

  git.x86_64 0:1.8.3.1-6.el7_2.1

  

Re: Build failed in Jenkins: Aurora #1715

2017-01-22 Thread Erb, Stephan
The latest gradle has a couple of bugfixes. We should try and see if an update 
helps here.

On 17/01/2017, 23:45, "John Sirois"  wrote:

So, this one again:

Exception applying rule InvalidSlf4jMessageFormat on file

/home/jenkins/jenkins-slave/workspace/Aurora/src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java,
continuing with next rule
java.lang.NullPointerException
at 
net.sourceforge.pmd.lang.java.rule.logging.InvalidSlf4jMessageFormatRule.countPlaceholders(InvalidSlf4jMessageFormatRule.java:199)
at 
net.sourceforge.pmd.lang.java.rule.logging.InvalidSlf4jMessageFormatRule.getAmountOfExpectedArguments(InvalidSlf4jMessageFormatRule.java:192)
at 
net.sourceforge.pmd.lang.java.rule.logging.InvalidSlf4jMessageFormatRule.expectedArguments(InvalidSlf4jMessageFormatRule.java:171)
at 
net.sourceforge.pmd.lang.java.rule.logging.InvalidSlf4jMessageFormatRule.visit(InvalidSlf4jMessageFormatRule.java:88)


But, more interesting further down in logs:


:testJava HotSpot(TM) 64-Bit Server VM warning: INFO:
os::commit_memory(0x00077d70, 191365120, 0) failed;
error='Cannot allocate memory' (errno=12)

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 191365120 bytes for
committing reserved memory.
# An error report file with more information is saved as:
# /home/jenkins/jenkins-slave/workspace/Aurora/hs_err_pid573.log
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 236453888 bytes for
committing reserved memory.
Java HotSpot(TM) 64-Bit Server VM warning: INFO:
os::commit_memory(0x000787e8, 236453888, 0) failed;
error='Cannot allocate memory' (errno=12)
# An error report file with more information is saved as:
# /home/jenkins/jenkins-slave/workspace/Aurora/hs_err_pid577.log
Could not stop 
org.gradle.internal.actor.internal.DefaultActorFactory$NonBlockingActor@5e850067.
org.gradle.internal.dispatch.DispatchException: Could not dispatch
message [MethodInvocation method: stop()].
at 
org.gradle.internal.dispatch.ExceptionTrackingFailureHandler.dispatchFailed(ExceptionTrackingFailureHandler.java:34)
at 
org.gradle.internal.dispatch.FailureHandlingDispatch.dispatch(FailureHandlingDispatch.java:31)
at 
org.gradle.internal.dispatch.AsyncDispatch.dispatchMessages(AsyncDispatch.java:132)
at 
org.gradle.internal.dispatch.AsyncDispatch.access$000(AsyncDispatch.java:33)
at 
org.gradle.internal.dispatch.AsyncDispatch$1.run(AsyncDispatch.java:72)
at 
org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:54)
at 
org.gradle.internal.concurrent.StoppableExecutorImpl$1.run(StoppableExecutorImpl.java:40)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.gradle.process.internal.ExecException: Process 'Gradle
Test Executor 4' finished with non-zero exit value 1


Test executors 6 and 7 also failed in this same way.


On Tue, Jan 17, 2017 at 3:41 PM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See 
>
> Changes:
>
> [john.sirois] Ensure Aurora thrift support js and html.
>
> --
> [...truncated 3475 lines...]
>   Test coverage missing for org/apache/aurora/scheduler/
> storage/db/DbJobUpdateStore$2
>   Test coverage missing for org/apache/aurora/scheduler/
> storage/db/DbCronJobStore
>   Test coverage missing for org/apache/aurora/scheduler/
> storage/db/DbModule$1$1
>   Test coverage missing for org/apache/aurora/scheduler/
> storage/db/DbModule$GarbageCollectorModule$1
>   Test coverage missing for org/apache/aurora/scheduler/
> storage/db/InsertResult
>   Test coverage missing for org/apache/aurora/scheduler/
> storage/db/DbStorage
>   Test coverage missing for org/apache/aurora/scheduler/
> storage/db/DbModule
>   Test coverage missing for org/apache/aurora/scheduler/
> storage/db/DbSchedulerStore
>   Test coverage missing for org/apache/aurora/scheduler/storage/db/DbUtil
>   Test coverage missing for org/apache/aurora/scheduler/resources/
> MesosResourceConverter$RangeConverter
>   Test coverage missing for org/apache/aurora/scheduler/
> resources/ResourceBag
>   Test coverage missing for org/apache/aurora/scheduler/resources/
> AuroraResourceConverter
>   Test coverage missing for 

Re: Adding support for Mesos' Kill Policy when running docker executor-less tasks.

2017-01-05 Thread Erb, Stephan
I will try to summarize an off-list discussion so that more people can 
participate:

Aurora has an unofficial way to launch Docker containers without Thermos. 
Rather than using the Thermos executor, Mesos will directly call the container 
entrypoint. This support was contributed by Bill 
(https://reviews.apache.org/r/44685/ ). An additional patch by John 
(https://reviews.apache.org/r/44745/ ) to expose this functionality within the 
client job configuration was discarded due to missing consensus at the time. 
This means, the entrypoint mode is only available for REST API users, and for 
users with patched clients.

The goal of Nicolás is now to provide a graceful shutdown for containers 
running without Thermos. He has prepared a minimal patch that sketches the idea 
https://github.com/apache/aurora/compare/master...medallia:KillPolicyGracePeriod.
 

How do we want to proceed here? Do we plan to improve our Docker entrypoint 
story? If yes, can we just re-open Johns RB and merge an extended version of 
Nicolás change, or do we need some additional planning?

I am happy to hear what you think.


On 29/12/2016, 16:48, "Nicolas Donatucci"  wrote:

Hello everybody.

I was thinking on adding support for the current Mesos' Grace Period Kill
Policy when running Docker containers without Thermos. It is currently the
only Kill Policy implemented by Mesos. (More information can be found here
https://github.com/apache/mesos/blob/master/CHANGELOG#L576-L585 and JIRA
issue here https://issues.apache.org/jira/browse/MESOS-4909)

My idea is to add a Kill Policy to TaskConfig in order to pass it on to
Mesos. The "finalization_wait" field of the task schema can be used to
create the corresponding Kill Policy.

What do you think?




Re: A mini postmortem on snapshot failures

2017-01-04 Thread Erb, Stephan
I am not aware of tickets or work on the _exact_ problem described in this 
thread.

However, there is currently some on-going work to improve scheduler performance 
and availability during snaphshots on large clusters:

• https://reviews.apache.org/r/54883/  (still pending)
• https://reviews.apache.org/r/54774/ with associated ticket 
https://issues.apache.org/jira/browse/AURORA-1861 (already merged)

Maybe those are also interesting for you.


On 31/12/2016, 00:03, "meghdoot bhattacharya"  
wrote:

Is there a ticket for this? In our large test cluster, we hit into this 
under heavy load.
Thx

  From: John Sirois 
 To: dev@aurora.apache.org 
 Sent: Wednesday, October 5, 2016 11:04 AM
 Subject: Re: A mini postmortem on snapshot failures
   
On Tue, Oct 4, 2016 at 11:55 AM, Joshua Cohen  wrote:

> Hi Zameer,
>
> Thanks for this writeup!
>
> I think one other option to consider would be using a connection for
> writing the snapshots that's not bound by the pool's maximum checkout 
time.
> I'm not sure if this is feasible or not, but I worry that there's
> potentially no upper bound on raising the maximum checkout time as the 
size
> of a cluster grows or its read traffic grows. It feels a bit heavy weight
> to up the max checkout time when potentially the only connection exceeding
> the limits is the one writing the snapshot.
>

+1 to the above, it would be great if the snapshotter thread could grab its
own dedicated connection.  In fact, since the operation is relatively rare
(compared to all other reads and writes), it could even grab and dispose of
a new connection per-snapshot to simplify things (ie: no need to check for
a stale connection like you would have to if you grabbed one and tried to
hold it for the lifetime of the scheduler process).


> I'd definitely be in favor of adding a flag to tune the maximum connection
> checkout.
>
> I'm neutral to negative on having Snapshot creation failures be fatal, I
> don't necessarily think that one failed snapshot should take the scheduler
> down, but I agree that some number of consecutive failures is a Bad Thing
> that is worthy of investigation. My concern with having failures be fatal
> is the pathological case where snapshots always fail causing your 
scheduler
> to failover once every SNAPSHOT_INTERVAL. Do you think it would be
> sufficient to add `scheduler_log_snapshots` to the list of important
> stats[1]?
>
> I'm also neutral on changing the defaults. I'm not sure if it's warranted,
> as the behavior will vary based on cluster. It seems like you guys got bit
> by this due to a comparatively heavy read load? Our cluster, on the other
> hand, is probably significantly larger, but is not queried as much, and we
> haven't run into issues with the defaults. However, as long as there are 
is
> no adverse impact to bumping the default values I've got no objections.
>
> Cheers,
>
> Joshua
>
> [1]
> https://github.com/apache/aurora/blob/master/docs/
> operations/monitoring.md#important-stats
>
> On Fri, Sep 30, 2016 at 7:34 PM, Zameer Manji  wrote:
>
> > Aurora Developers and Users,
> >
> > I would like to share failure case I experienced recently. In a modestly
> > sized production cluster with high read load, snapshot creation began to
> > fail. The logs showed the following:
> >
> > 
> > W0923 00:23:55.528 [LogStorage-0, LogStorage:473]
> > ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
> > accessing PooledConnection. Connection is invalid.
> > ### Cause: java.sql.SQLException: Error accessing PooledConnection.
> > Connection is invalid. org.apache.ibatis.exceptions.
> PersistenceException:
> > ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
> > accessing PooledConnection. Connection is invalid.
> > ### Cause: java.sql.SQLException: Error accessing PooledConnection.
> > Connection is invalid.
> > at
> > org.apache.ibatis.exceptions.ExceptionFactory.wrapException(
> > ExceptionFactory.java:30)
> > at
> > org.apache.ibatis.session.defaults.DefaultSqlSession.
> > rollback(DefaultSqlSession.java:216)
> > at
> > org.apache.ibatis.session.SqlSessionManager.rollback(
> > SqlSessionManager.java:299)
> > at
> > org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(
> > TransactionalMethodInterceptor.java:116)
> > at
> > org.apache.aurora.scheduler.storage.db.DbStorage.lambda$
> > write$0(DbStorage.java:175)
> > at
> > org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(
> > GatingDelayExecutor.java:62)
> > at

Re: Build failed in Jenkins: aurora-packaging-nightly #535

2017-01-04 Thread Erb, Stephan
I tried to investigate the repeated build failures of the Aurora packaging 
build job but was not able to find the cause. Repeated builds on my local 
machine did not trigger the same error. 

If anyone else has an idea what could be wrong with  
https://builds.apache.org/view/All/job/aurora-packaging-nightly/  please step 
forward.



On 04/01/2017, 01:38, "Apache Jenkins Server"  wrote:

See 

--
[...truncated 18959 lines...]
failure
failure_limit
hello_world
ordering
ports
sleep60


artifacts/aurora-centos-7/dist/rpmbuild/BUILD/apache-aurora-0.17.0-SNAPSHOT/src/test/sh:
org


artifacts/aurora-centos-7/dist/rpmbuild/BUILD/apache-aurora-0.17.0-SNAPSHOT/src/test/sh/org:
apache


artifacts/aurora-centos-7/dist/rpmbuild/BUILD/apache-aurora-0.17.0-SNAPSHOT/src/test/sh/org/apache:
aurora


artifacts/aurora-centos-7/dist/rpmbuild/BUILD/apache-aurora-0.17.0-SNAPSHOT/src/test/sh/org/apache/aurora:
e2e


artifacts/aurora-centos-7/dist/rpmbuild/BUILD/apache-aurora-0.17.0-SNAPSHOT/src/test/sh/org/apache/aurora/e2e:
Dockerfile.netcat
Dockerfile.python
ephemeral_daemon_with_final.aurora
http
http_example.py
run-server.sh
test_bypass_leader_redirect_end_to_end.sh
test_daemonizing_process.aurora
test_end_to_end.sh
test_kerberos_end_to_end.sh
validate_serverset.py


artifacts/aurora-centos-7/dist/rpmbuild/BUILD/apache-aurora-0.17.0-SNAPSHOT/src/test/sh/org/apache/aurora/e2e/http:
http_example.aurora
http_example_bad_healthcheck.aurora
http_example_updated.aurora

artifacts/aurora-centos-7/dist/rpmbuild/BUILDROOT:

artifacts/aurora-centos-7/dist/rpmbuild/RPMS:
x86_64

artifacts/aurora-centos-7/dist/rpmbuild/RPMS/x86_64:
aurora-executor-0.17.0_SNAPSHOT.2016.12.12-1.el7.centos.aurora.x86_64.rpm
aurora-executor-0.17.0_SNAPSHOT.2016.12.14-1.el7.centos.aurora.x86_64.rpm
aurora-executor-0.17.0_SNAPSHOT.2016.12.15-1.el7.centos.aurora.x86_64.rpm
aurora-executor-0.17.0_SNAPSHOT.2016.12.16-1.el7.centos.aurora.x86_64.rpm
aurora-executor-0.17.0_SNAPSHOT.2016.12.26-1.el7.centos.aurora.x86_64.rpm
aurora-executor-0.17.0_SNAPSHOT.2016.12.29-1.el7.centos.aurora.x86_64.rpm
aurora-executor-0.17.0_SNAPSHOT.2017.01.03-1.el7.centos.aurora.x86_64.rpm
aurora-scheduler-0.17.0_SNAPSHOT.2016.12.12-1.el7.centos.aurora.x86_64.rpm
aurora-scheduler-0.17.0_SNAPSHOT.2016.12.14-1.el7.centos.aurora.x86_64.rpm
aurora-scheduler-0.17.0_SNAPSHOT.2016.12.15-1.el7.centos.aurora.x86_64.rpm
aurora-scheduler-0.17.0_SNAPSHOT.2016.12.16-1.el7.centos.aurora.x86_64.rpm
aurora-scheduler-0.17.0_SNAPSHOT.2016.12.26-1.el7.centos.aurora.x86_64.rpm
aurora-scheduler-0.17.0_SNAPSHOT.2016.12.29-1.el7.centos.aurora.x86_64.rpm
aurora-scheduler-0.17.0_SNAPSHOT.2017.01.03-1.el7.centos.aurora.x86_64.rpm

aurora-scheduler-debuginfo-0.17.0_SNAPSHOT.2016.12.12-1.el7.centos.aurora.x86_64.rpm

aurora-scheduler-debuginfo-0.17.0_SNAPSHOT.2016.12.14-1.el7.centos.aurora.x86_64.rpm

aurora-scheduler-debuginfo-0.17.0_SNAPSHOT.2016.12.15-1.el7.centos.aurora.x86_64.rpm

aurora-scheduler-debuginfo-0.17.0_SNAPSHOT.2016.12.16-1.el7.centos.aurora.x86_64.rpm

aurora-scheduler-debuginfo-0.17.0_SNAPSHOT.2016.12.26-1.el7.centos.aurora.x86_64.rpm

aurora-scheduler-debuginfo-0.17.0_SNAPSHOT.2016.12.29-1.el7.centos.aurora.x86_64.rpm

aurora-scheduler-debuginfo-0.17.0_SNAPSHOT.2017.01.03-1.el7.centos.aurora.x86_64.rpm
aurora-tools-0.17.0_SNAPSHOT.2016.12.12-1.el7.centos.aurora.x86_64.rpm
aurora-tools-0.17.0_SNAPSHOT.2016.12.14-1.el7.centos.aurora.x86_64.rpm
aurora-tools-0.17.0_SNAPSHOT.2016.12.15-1.el7.centos.aurora.x86_64.rpm
aurora-tools-0.17.0_SNAPSHOT.2016.12.16-1.el7.centos.aurora.x86_64.rpm
aurora-tools-0.17.0_SNAPSHOT.2016.12.26-1.el7.centos.aurora.x86_64.rpm
aurora-tools-0.17.0_SNAPSHOT.2016.12.29-1.el7.centos.aurora.x86_64.rpm
aurora-tools-0.17.0_SNAPSHOT.2017.01.03-1.el7.centos.aurora.x86_64.rpm
repodata

artifacts/aurora-centos-7/dist/rpmbuild/RPMS/x86_64/repodata:

077b3c5c12e5bc075694b4fc884b503e4167dcabba4502bdd72715cc1a464d7a-other.sqlite.bz2

162ac872fa2d3fde308a44e8617eb9a74a9649a0f4a09db1a8103033c4bf8dc9-primary.xml.gz

1bba22db0bdc1c51d165a9562d8c2a369ec0361b494e80ee51231e64283be84a-primary.sqlite.bz2

20ebec8dede455208f21767f897788be4cb2ff4465d63c62276d578e3497c337-other.xml.gz

22e71dace89c0e45d4fcb3afc6f805973d8b6a4e8400217f3e2d099cad2e4742-primary.xml.gz

247cc5d9118a98da934750f93ac1a03c39f9ee5e21ffb1f76cfea1ff6c01844b-filelists.xml.gz

262601e2b003888bdc9849b10e5aeac215981e64a4bedf3e07fb19d32a0e4f11-filelists.xml.gz

2c8f782d7e0a79c7971e4c0bcbe14260c298cfad7e84a963d598855b06cbace9-filelists.xml.gz


Proposal: Move snapshots into a separate log

2016-12-27 Thread Erb, Stephan
David has posted a great patch & design document on Review Board:

RB: https://reviews.apache.org/r/54883/
Design Doc: 
https://docs.google.com/document/d/1QVSEfeoCyt2D6cCmTCxy8-epufcuzIfnqRUkyT1betY/edit?usp=sharing

I am merely reposting his work here so that it won’t be missed by casual 
readers of the mailing list.

Best regards,
Stephan


Re: [DRAFT][REPORT]: Apache Aurora - December 2016

2016-12-07 Thread Erb, Stephan
Looks good to me. Thanks!

On 07/12/16 17:34, "Joshua Cohen"  wrote:

+1, thanks for doing this Jake!

On Wed, Dec 7, 2016 at 9:55 AM, Jake Farrell  wrote:

>  Below is our draft board report which is due next week. Please let me 
know
> if you see any additions or changes that should be made, I'll plan on
> submitting this Friday if there are no objections
>
> -Jake
>
>
>
> Apache Aurora is a stateless and fault tolerant service scheduler used to
> schedule jobs onto Apache Mesos such as long-running services, cron jobs,
> and one off tasks.
>
> Project Status
> -
> The Apache Aurora community has continued to see growth from new users and
> contributors over the last quarter. There has been a great deal of 
activity
> in open
> code reviews with excellent community involvement. In September we 
released
> Apache Aurora 0.16.0 and since have started discussions about our next
>  0.17.0
> release candidate. The next release will focus on a number of bug
> fixes, stability
> enhancements as well as some new features outlined in our release notes 
[1]
>
> Community
> ---
> Latest Additions:
>
> * Committer addition: Stephan Erb, 2.3.2016
> * PMC addition:  Stephan Erb, 2.3.2016
>
> Issue backlog status since last report:
>
> * Created:   86
> * Resolved: 57
>
> Mailing list activity since last report:
>
> * @dev169 messages
> * @user   37 messages
> * @reviews  1391 messages
>
> Releases
> ---
> Last release: Apache Aurora 0.16.0 released 09.28.2016
>
>
> [1]: https://github.com/apache/aurora/blob/master/RELEASE-NOTES.md
>




Re: Preparations for 0.17.0

2016-11-21 Thread Erb, Stephan
In the last two weeks we have several additional tickets to the 0.17 milestone 
without making much progress on shortening the list. In addition, we have 
several reviews on RB going stale due to lack of progress by both reviewers and 
contributors (myself included). 

We should try not to lose focus here. Many features & fixes in a release are 
great. More regular releases are even better :-)


On 03/11/16 23:01, "Renan DelValle" <rdelv...@binghamton.edu> wrote:

I'd really like to take care of
https://issues.apache.org/jira/browse/AURORA-1780 for 0.17.0, at least
allow the scheduler to take less drastic measures.

If any one wants to submit any feedback as to how we should tackle this,
I'm all ears.

-Renan

On Thu, Nov 3, 2016 at 4:18 PM, Joshua Cohen <jco...@apache.org> wrote:

> I added https://issues.apache.org/jira/browse/AURORA-1782 to 0.17.0
> Hopefully I'll have time to look into that soon.
>
> On Wed, Nov 2, 2016 at 4:53 PM, Zameer Manji <zma...@apache.org> wrote:
>
> > Thanks for stepping up!
> >
> > I think picking up Mesos 1.1 is ideal for the release and we should 
block
> > our release until it is out.
> >
> > On Wed, Nov 2, 2016 at 9:24 AM, Erb, Stephan <
> stephan@blue-yonder.com>
> > wrote:
> >
> > > Hi everyone,
> > >
> > > I’d like to volunteer as our next release manager and set the release
> > > train for 0.17 into motion.
> > >
> > > Since 0.16 we have fixed several important bugs and should therefore
> aim
> > > for a release in the next 2-4 weeks, if possible. If everything goes
> > > according to plan for the current Mesos release, we should be able to
> > pick
> > > up Mesos 1.1 as well.
> > >
> > > Please tag any blocker issues with `fixVersion 0.17` so that they show
> up
> > > on this dashboard: https://issues.apache.org/
> > jira/browse/AURORA-1014?jql=
> > > project%20%3D%20AURORA%20AND%20fixVersion%20%3D%200.17.0%
> > > 20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%
> > > 20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC
> > >
> > > Best regards,
> > > Stephan
> > >
> > > --
> > > Zameer Manji
> > >
> >
>




Preparations for 0.17.0

2016-11-02 Thread Erb, Stephan
Hi everyone,

I’d like to volunteer as our next release manager and set the release train for 
0.17 into motion.

Since 0.16 we have fixed several important bugs and should therefore aim for a 
release in the next 2-4 weeks, if possible. If everything goes according to 
plan for the current Mesos release, we should be able to pick up Mesos 1.1 as 
well.

Please tag any blocker issues with `fixVersion 0.17` so that they show up on 
this dashboard: 
https://issues.apache.org/jira/browse/AURORA-1014?jql=project%20%3D%20AURORA%20AND%20fixVersion%20%3D%200.17.0%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

Best regards,
Stephan


AttributeAggregate performance

2016-10-25 Thread Erb, Stephan
Hi everyone,

I believe I might have found a small performance regression in our scheduling 
code:

https://issues.apache.org/jira/browse/AURORA-1802

I don’t have the cycles (or the necessity) to look into this any further at 
that point in time. However, for some of you with huge clusters this may be of 
interest. This probably a rather isolated changed and therefore also suitable 
for rather new contributors.

Best regards,
Stephan


Re: [VOTE] Release Apache Aurora 0.16.0 packages

2016-10-24 Thread Erb, Stephan
+1 (binding)

Also verified via the test instructions.

On 24/10/16 21:41, "John Sirois"  wrote:

+1 (binding)

Tested using the 3 vagrant test setups under aurora-packaging/test.
There were the `libcurl4-nss-dev` missing dep issues in each of the deb vms
that I worked around with `sudo apt-get -f install`, but these issues were
merely test setup issues and not issues with the aurora packages which
installed and worked flawlessly.

On Thu, Oct 20, 2016 at 5:39 PM, Joshua Cohen  wrote:

> The git url does? This one works for me:
> https://git1-us-west.apache.org/repos/asf?p=aurora-
> packaging.git;a=tree;hb=refs/heads/0.16.x
>
> $ curl -I
> https://git1-us-west.apache.org/repos/asf?p=aurora-
> packaging.git;a=tree;hb=refs/heads/0.16.x
> HTTP/1.1 200 OK
> Date: Thu, 20 Oct 2016 21:38:26 GMT
> Server: Apache/2.4.7 (Ubuntu)
> Vary: Accept-Encoding
> Content-Type: text/html; charset=utf-8
>
>
> On Thu, Oct 20, 2016 at 4:29 PM, Henry Saputra 
> wrote:
>
> > Seems like it returns 404?
> >
> > On Thu, Oct 20, 2016 at 10:38 AM, Joshua Cohen 
> wrote:
> >
> > > Also, apparently gmail didn't update the underlying urls for the git
> > links
> > > when I c/p'd the email from the 0.15.0 vote. This is the actual branch
> is
> > > here: https://git1-us-west.apache.org/repos/asf?p=aurora-packaging
> > > .git;a=tree;hb=refs/heads/0.16.x
> > >  > > packaging.git;a=tree;hb=refs/heads/0.15.x>
> > >
> > > On Thu, Oct 20, 2016 at 10:34 AM, Joshua Cohen 
> > wrote:
> > >
> > > > I'll start the voting off with my own +1. I verified packages for 
all
> > > > three platforms against the test vagrant images from the packaging
> > repo.
> > > >
> > > > On Thu, Oct 20, 2016 at 10:33 AM, Joshua Cohen 
> > > wrote:
> > > >
> > > >> All,
> > > >>
> > > >>
> > > >> I propose that we accept the following artifacts as the official 
deb
> > > >> and rpm packaging for Apache Aurora 0.16.0:
> > > >>
> > > >>
> > > >> *https://dl.bintray.com/jcohen/aurora/
> > > >> *
> > > >>
> > > >>
> > > >> The Aurora deb and rpm packaging includes the following:
> > > >>
> > > >> ---
> > > >>
> > > >> The branch used to create the packaging is:
> > > >>
> > > >> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging
> > > >> .git;a=tree;hb=refs/heads/0.16.x
> > > >>  > > packaging.git;a=tree;hb=refs/heads/0.15.x>
> > > >>
> > > >>
> > > >> The packages are available at:
> > > >>
> > > >> *https://dl.bintray.com/jcohen/aurora/
> > > >> *
> > > >>
> > > >>
> > > >> The GPG keys used to sign the packages are available at:
> > > >>
> > > >> https://dist.apache.org/repos/dist/release/aurora/KEYS
> > > >>
> > > >>
> > > >> Please download, verify, and test. Detailed test instructions are
> > > >> available here
> > > >>
> > > >> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging
> > > >> .git;a=tree;f=test;hb=refs/heads/0.16.x
> > > >>  > > packaging.git;a=tree;f=test;hb=refs/heads/0.15.x>
> > > >>
> > > >>
> > > >> The vote will close on Tue Oct  25 10:30:00 CDT 2016
> > > >>
> > > >>
> > > >>
> > > >> [ ] +1 Release these as the deb and rpm packages for Apache Aurora
> > > 0.15.0
> > > >>
> > > >> [ ] +0
> > > >>
> > > >> [ ] -1 Do not release these artifacts because...
> > > >>
> > > >
> > > >
> > >
> >
>




Re: Build failed in Jenkins: aurora-packaging-nightly #452

2016-10-12 Thread Erb, Stephan
All those recent packaging failures are due to the gradle major version 
upgrade. Fix is here: https://reviews.apache.org/r/52777/

On 12/10/16 02:31, "Apache Jenkins Server"  wrote:

See 

Changes:

[serb] Update mybatis, h2, and jmh to their latest versions.

--
[...truncated 97 lines...]
Starting a Gradle Daemon (subsequent builds will be faster)
> Starting Daemon> Starting Daemon > Connecting to Daemon> 
LoadingInvalidating buildSrc state cache 
(/scratch/buildSrc/.gradle/noVersion/buildSrc) as it was not closed cleanly.
> Loading > buildSrc > settings> Loading > buildSrc> Loading > buildSrc > 
Compiling 
jar:file:/opt/gradle-3.1/lib/gradle-core-3.1.jar!/org/gradle/initialization/buildsrc/defaultBuildSourceScript.txt
 into local build cache> Loading > buildSrc > Compiling 
jar:file:/opt/gradle-3.1/lib/gradle-core-3.1.jar!/org/gradle/initialization/buildsrc/defaultBuildSourceScript.txt
 into local build cache > Compiling script 
'jar:file:/opt/gradle-3.1/lib/gradle-core-3.1.jar!/org/gradle/initialization/buildsrc/defaultBuildSourceScript.txt'
 to cross build script cache> Loading > buildSrc > Compiling 
jar:file:/opt/gradle-3.1/lib/gradle-core-3.1.jar!/org/gradle/initialization/buildsrc/defaultBuildSourceScript.txt
 into local build cache> Loading > buildSrc > Compiling 
jar:file:/opt/gradle-3.1/lib/gradle-core-3.1.jar!/org/gradle/initialization/buildsrc/defaultBuildSourceScript.txt
 into local build cache > Compiling script 
'jar:file:/opt/gradle-3.1/lib/gradle-core-3.1.jar!/org/gradle/initialization/buildsrc/defaultBuildSourceScript.txt'
 to cross build script cache> Loading > buildSrc > Compiling 
jar:file:/opt/gradle-3.1/lib/gradle-core-3.1.jar!/org/gradle/initialization/buildsrc/defaultBuildSourceScript.txt
 into local build cache> Loading > buildSrcGenerating JAR file 
'gradle-api-3.1.jar'
> Loading > buildSrc> Loading > buildSrc > Generating 0%> Loading > 
buildSrc > Generating 1%> Loading > buildSrc > Generating 2%> Loading > 
buildSrc > Generating 3%> Loading > buildSrc > Generating 4%> Loading > 
buildSrc > Generating 8%> Loading > buildSrc > Generating 9%> Loading > 
buildSrc > Generating 10%> Loading > buildSrc > Generating 13%> Loading > 
buildSrc > Generating 20%> Loading > buildSrc > Generating 22%> Loading > 
buildSrc > Generating 25%> Loading > buildSrc > Generating 26%> Loading > 
buildSrc > Generating 27%> Loading > buildSrc > Generating 28%> Loading > 
buildSrc > Generating 29%> Loading > buildSrc > Generating 32%> Loading > 
buildSrc > Generating 36%> Loading > buildSrc > Generating 38%> Loading > 
buildSrc > Generating 41%> Loading > buildSrc > Generating 45%> Loading > 
buildSrc > Generating 47%> Loading > buildSrc > Generating 49%> Loading > 
buildSrc > Generating 50%> Loading > buildSrc > Generating 51%> Loading > 
buildSrc > Generating 52%> Loading > buildSrc > Generating 54%> Loading > 
buildSrc > Generating 56%> Loading > buildSrc > Generating 58%> Loading > 
buildSrc > Generating 59%> Loading > buildSrc > Generating 62%> Loading > 
buildSrc > Generating 63%> Loading > buildSrc > Generating 64%> Loading > 
buildSrc > Generating 65%> Loading > buildSrc > Generating 67%> Loading > 
buildSrc > Generating 69%> Loading > buildSrc > Generating 70%> Loading > 
buildSrc > Generating 72%> Loading > buildSrc > Generating 74%> Loading > 
buildSrc > Generating 76%> Loading > buildSrc > Generating 77%> Loading > 
buildSrc > Generating 78%> Loading > buildSrc > Generating 84%> Loading > 
buildSrc > Generating 85%> Loading > buildSrc > Generating 86%> Loading > 
buildSrc > Generating 87%> Loading > buildSrc > Generating 88%> Loading > 
buildSrc > Generating 89%> Loading > buildSrc > Generating 92%> Loading > 
buildSrc > Generating 93%> Loading > buildSrc > Generating 94%> Loading > 
buildSrc > Generating 95%> Loading > buildSrc > Generating 100% 
   
> Loading > buildSrc> Loading > buildSrc > Compiling 
/scratch/buildSrc/build.gradle into local build cache> Loading > buildSrc > 
Compiling /scratch/buildSrc/build.gradle into local build cache > Compiling 
build file '/scratch/buildSrc/build.gradle' to cross build script cache> 
Loading > buildSrc > Compiling /scratch/buildSrc/build.gradle into local build 
cache> Loading > buildSrc > Resolving dependencies ':runtime'> Loading > 
buildSrc> Loading > buildSrc > :buildSrc:clean:buildSrc:clean UP-TO-DATE
:buildSrc:compileJava UP-TO-DATE
> Loading > buildSrc > :buildSrc:compileGroovy:buildSrc:compileGroovy
> Loading > buildSrc:buildSrc:processResources UP-TO-DATE
:buildSrc:classes
> Loading > buildSrc > :buildSrc:jar:buildSrc:jar
:buildSrc:assemble
:buildSrc:compileTestJava UP-TO-DATE
:buildSrc:compileTestGroovy UP-TO-DATE
:buildSrc:processTestResources UP-TO-DATE
:buildSrc:testClasses 

Re: A mini postmortem on snapshot failures

2016-10-04 Thread Erb, Stephan
Thanks for the pointers regarding the broken documentation. I will fix that.

The configuration options have moved and are now described here 
http://aurora.apache.org/documentation/latest/operations/configuration/#replicated-log-configuration


On 03/10/16 09:05, "meghdoot bhattacharya"  wrote:

Zameer, thanks for sharing this. For folks who are looking to operate 
Aurora with HA this is very valuable. Operational insights from aurora experts 
is always welcome.Not to hijack the conversation on the 3 questions you asked, 
I found inhttp://aurora.apache.org/documentation/latest/operations/storage/

the links to "Replicated log configuration" and "Backup configuration" 
broken. Last reference I can find in 0.12 
releasehttp://aurora.apache.org/documentation/0.12.0/storage-config/

If the operational documents can be reviewed and enhanced further that 
would be helpful. Highlighting stats like 'scheduler_log_snapshots' or the load 
time for example can be alert points to operators.
Thx
  From: Zameer Manji 
 To: dev@aurora.apache.org; u...@aurora.apache.org 
 Sent: Friday, September 30, 2016 5:34 PM
 Subject: A mini postmortem on snapshot failures
   
Aurora Developers and Users,

I would like to share failure case I experienced recently. In a modestly
sized production cluster with high read load, snapshot creation began to
fail. The logs showed the following:


W0923 00:23:55.528 [LogStorage-0, LogStorage:473]
### Error rolling back transaction.  Cause: java.sql.SQLException: Error
accessing PooledConnection. Connection is invalid.
### Cause: java.sql.SQLException: Error accessing PooledConnection.
Connection is invalid. org.apache.ibatis.exceptions.PersistenceException:
### Error rolling back transaction.  Cause: java.sql.SQLException: Error
accessing PooledConnection. Connection is invalid.
### Cause: java.sql.SQLException: Error accessing PooledConnection.
Connection is invalid.
at

org.apache.ibatis.exceptions.ExceptionFactory.wrapException(ExceptionFactory.java:30)
at

org.apache.ibatis.session.defaults.DefaultSqlSession.rollback(DefaultSqlSession.java:216)
at

org.apache.ibatis.session.SqlSessionManager.rollback(SqlSessionManager.java:299)
at

org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:116)
at

org.apache.aurora.scheduler.storage.db.DbStorage.lambda$write$0(DbStorage.java:175)
at

org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(GatingDelayExecutor.java:62)
at
org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:173)
at

org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
at

org.apache.aurora.scheduler.storage.log.LogStorage.doInTransaction(LogStorage.java:521)
at

org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:551)
at

org.apache.aurora.scheduler.storage.log.LogStorage.doSnapshot(LogStorage.java:489)
at

org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
at

org.apache.aurora.scheduler.storage.log.LogStorage.snapshot(LogStorage.java:565)
at

org.apache.aurora.scheduler.storage.log.LogStorage.lambda$scheduleSnapshots$20(LogStorage.java:468)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLException: Error accessing PooledConnection.
Connection is invalid.
at

org.apache.ibatis.datasource.pooled.PooledConnection.checkConnection(PooledConnection.java:254)
at

org.apache.ibatis.datasource.pooled.PooledConnection.invoke(PooledConnection.java:243)
at com.sun.proxy.$Proxy135.getAutoCommit(Unknown Source)
at

org.apache.ibatis.transaction.jdbc.JdbcTransaction.rollback(JdbcTransaction.java:79)
at org.apache.ibatis.executor.BaseExecutor.rollback(BaseExecutor.java:249)
at

org.apache.ibatis.executor.CachingExecutor.rollback(CachingExecutor.java:119)
at

org.apache.ibatis.session.defaults.DefaultSqlSession.rollback(DefaultSqlSession.java:213)
... 19 common frames omitted


This failure is silent and can be observed only through the
`scheduler_log_snapshots` metric, if it isn't 

Re: A mini postmortem on snapshot failures

2016-10-04 Thread Erb, Stephan
An immediate failover seems rather drastic too me. However, I have no anecdotal 
evidence to back up this feeling or any other default config changes. Maybe 
Joshua can share what they are using so that we can adjust the default values 
accordingly?

Other thoughts:
• Have you tried this magic trick? 
https://issues.apache.org/jira/browse/AURORA-1211?focusedCommentId=14483533=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14483533
• A counter for failed scheduler backups sounds like a good idea. We could also 
mention this in our monitoring documentation.

Best regards,
Stephan

On 01/10/16 02:34, "Zameer Manji"  wrote:

Aurora Developers and Users,

I would like to share failure case I experienced recently. In a modestly
sized production cluster with high read load, snapshot creation began to
fail. The logs showed the following:


W0923 00:23:55.528 [LogStorage-0, LogStorage:473]
### Error rolling back transaction.  Cause: java.sql.SQLException: Error
accessing PooledConnection. Connection is invalid.
### Cause: java.sql.SQLException: Error accessing PooledConnection.
Connection is invalid. org.apache.ibatis.exceptions.PersistenceException:
### Error rolling back transaction.  Cause: java.sql.SQLException: Error
accessing PooledConnection. Connection is invalid.
### Cause: java.sql.SQLException: Error accessing PooledConnection.
Connection is invalid.
at

org.apache.ibatis.exceptions.ExceptionFactory.wrapException(ExceptionFactory.java:30)
at

org.apache.ibatis.session.defaults.DefaultSqlSession.rollback(DefaultSqlSession.java:216)
at

org.apache.ibatis.session.SqlSessionManager.rollback(SqlSessionManager.java:299)
at

org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:116)
at

org.apache.aurora.scheduler.storage.db.DbStorage.lambda$write$0(DbStorage.java:175)
at

org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(GatingDelayExecutor.java:62)
at
org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:173)
at

org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
at

org.apache.aurora.scheduler.storage.log.LogStorage.doInTransaction(LogStorage.java:521)
at

org.apache.aurora.scheduler.storage.log.LogStorage.write(LogStorage.java:551)
at

org.apache.aurora.scheduler.storage.log.LogStorage.doSnapshot(LogStorage.java:489)
at

org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
at

org.apache.aurora.scheduler.storage.log.LogStorage.snapshot(LogStorage.java:565)
at

org.apache.aurora.scheduler.storage.log.LogStorage.lambda$scheduleSnapshots$20(LogStorage.java:468)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLException: Error accessing PooledConnection.
Connection is invalid.
at

org.apache.ibatis.datasource.pooled.PooledConnection.checkConnection(PooledConnection.java:254)
at

org.apache.ibatis.datasource.pooled.PooledConnection.invoke(PooledConnection.java:243)
at com.sun.proxy.$Proxy135.getAutoCommit(Unknown Source)
at

org.apache.ibatis.transaction.jdbc.JdbcTransaction.rollback(JdbcTransaction.java:79)
at org.apache.ibatis.executor.BaseExecutor.rollback(BaseExecutor.java:249)
at

org.apache.ibatis.executor.CachingExecutor.rollback(CachingExecutor.java:119)
at

org.apache.ibatis.session.defaults.DefaultSqlSession.rollback(DefaultSqlSession.java:213)
... 19 common frames omitted


This failure is silent and can be observed only through the
`scheduler_log_snapshots` metric, if it isn't increasing then snapshots are
not being created. In this cluster, a snapshot was not taken for about 4
days.
For those unfamiliar with Aurora's replicated log storage system, snapshot
creation is important because it allows us to truncate the number of
entries in the replicated log to a single large entry. This is required
because the log recovery time is proportional to the number of entries in
the log. Operators can observe the amount of time it takes to recover the
log at startup via the `scheduler_log_recover_nanos_total` metric.

The largest value 

Re: [VOTE] Release Apache Aurora 0.16.0 RC1

2016-09-21 Thread Erb, Stephan
Unfortunate -1 from me. I bumped into this: 
https://issues.apache.org/jira/browse/AURORA-1779 


On 20/09/16 22:37, "Joshua Cohen"  wrote:

I'll start with my own +1 vote.

Verified with the verify-release-candidate script.

On Tue, Sep 20, 2016 at 3:01 PM, Joshua Cohen  wrote:

> All,
>
> I propose that we accept the following release candidate as the official
> Apache Aurora 0.16.0 release.
>
> Aurora 0.16.0-rc1 includes the following:
> ---
> The RELEASE NOTES for the release are available at:
> https://git-wip-us.apache.org/repos/asf?p=aurora.git=
> RELEASE-NOTES.md=rel/0.16.0-rc1
>
> The CHANGELOG for the release is available at:
> https://git-wip-us.apache.org/repos/asf?p=aurora.git=
> CHANGELOG=rel/0.16.0-rc1
>
> The tag used to create the release candidate is:
> https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=
> shortlog;h=refs/tags/rel/0.16.0-rc1
>
> The release candidate is available at:
> https://dist.apache.org/repos/dist/dev/aurora/0.16.0-rc1/
> apache-aurora-0.16.0-rc1.tar.gz
>
> The MD5 checksum of the release candidate can be found at:
> https://dist.apache.org/repos/dist/dev/aurora/0.16.0-rc1/
> apache-aurora-0.16.0-rc1.tar.gz.md5
>
> The signature of the release candidate can be found at:
> https://dist.apache.org/repos/dist/dev/aurora/0.16.0-rc1/
> apache-aurora-0.16.0-rc1.tar.gz.asc
>
> The GPG key used to sign the release are available at:
> https://dist.apache.org/repos/dist/dev/aurora/KEYS
>
> Please download, verify, and test.
>
> The vote will close on Fri Sep 23 15:00:57 CDT 2016
>
> [ ] +1 Release this as Apache Aurora 0.16.0
> [ ] +0
> [ ] -1 Do not release this as Apache Aurora 0.16.0 because...
>
> 
> 
>





Re: Aurora 0.16.0 release

2016-09-08 Thread Erb, Stephan
I would like to get https://reviews.apache.org/r/51664/ into the release. I am 
open for feedback and will have time to update the review request on early 
Friday.

Thanks,
Stephan


On 06/09/16 17:36, "Joshua Cohen"  wrote:

Hi Aurorans,

I plan to kick off the 0.16.0 release some time later this week. Please let
me know if there are any outstanding patches you'd like to ship before this
release.

Thanks!

Joshua





Re: Build failed in Jenkins: aurora-packaging-nightly #408

2016-08-29 Thread Erb, Stephan
Filed an issue with pants: https://github.com/pantsbuild/pants/issues/3817 


On 29/08/16 02:34, "Apache Jenkins Server"  wrote:

See 

--
[...truncated 1404 lines...]
77% || 1.2MB 31.0MB/s eta 0:00:01
77% || 1.3MB 30.9MB/s eta 0:00:01
78% |#   | 1.3MB 30.1MB/s eta 0:00:01
79% |#   | 1.3MB 30.8MB/s eta 0:00:01
79% |#   | 1.3MB 32.4MB/s eta 0:00:01
80% |#   | 1.3MB 31.5MB/s eta 0:00:01
80% |#   | 1.3MB 32.2MB/s eta 0:00:01
81% |##  | 1.3MB 32.7MB/s eta 0:00:01
82% |##  | 1.3MB 32.9MB/s eta 0:00:01
82% |##  | 1.3MB 33.3MB/s eta 0:00:01
83% |##  | 1.4MB 32.8MB/s eta 0:00:01
84% |##  | 1.4MB 33.3MB/s eta 0:00:01
84% |### | 1.4MB 33.2MB/s eta 0:00:01
85% |### | 1.4MB 31.9MB/s eta 0:00:01
86% |### | 1.4MB 32.9MB/s eta 0:00:01
86% |### | 1.4MB 32.8MB/s eta 0:00:01
87% |### | 1.4MB 32.8MB/s eta 0:00:01
87% || 1.4MB 32.6MB/s eta 0:00:01
88% || 1.4MB 31.6MB/s eta 0:00:01
89% || 1.4MB 32.6MB/s eta 0:00:01
89% || 1.5MB 32.5MB/s eta 0:00:01
90% || 1.5MB 32.1MB/s eta 0:00:01
91% |#   | 1.5MB 32.8MB/s eta 0:00:01
91% |#   | 1.5MB 33.2MB/s eta 0:00:01
92% |#   | 1.5MB 22.4MB/s eta 0:00:01
93% |#   | 1.5MB 22.3MB/s eta 0:00:01
93% |#   | 1.5MB 21.9MB/s eta 0:00:01
94% |##  | 1.5MB 22.4MB/s eta 0:00:01
94% |##  | 1.5MB 22.4MB/s eta 0:00:01
95% |##  | 1.5MB 22.0MB/s eta 0:00:01
96% |##  | 1.6MB 22.5MB/s eta 0:00:01
96% |##  | 1.6MB 22.5MB/s eta 0:00:01
97% |### | 1.6MB 22.6MB/s eta 0:00:01
98% |### | 1.6MB 22.6MB/s eta 0:00:01
98% |### | 1.6MB 32.8MB/s eta 0:00:01
99% |### | 1.6MB 33.9MB/s eta 0:00:01
99% |### | 1.6MB 33.9MB/s eta 0:00:01
100% || 1.6MB 485kB/s 
Collecting twitter.common.confluence<0.4,>=0.3.1 (from 
pantsbuild.pants==1.1.0-rc7)
  Downloading twitter.common.confluence-0.3.7.tar.gz
Collecting fasteners==0.14.1 (from pantsbuild.pants==1.1.0-rc7)
  Downloading fasteners-0.14.1-py2.py3-none-any.whl
Collecting coverage<3.8,>=3.7 (from pantsbuild.pants==1.1.0-rc7)
  Downloading coverage-3.7.1.tar.gz (284kB)

3% |#   | 10kB 35.8MB/s eta 0:00:01
7% |##  | 20kB 36.6MB/s eta 0:00:01
10% |### | 30kB 40.9MB/s eta 0:00:01
14% || 40kB 41.8MB/s eta 0:00:01
17% |#   | 51kB 42.7MB/s eta 0:00:01
21% |##  | 61kB 45.1MB/s eta 0:00:01
25% || 71kB 45.9MB/s eta 0:00:01
28% |#   | 81kB 47.7MB/s eta 0:00:01
32% |##  | 92kB 48.1MB/s eta 0:00:01
35% |### | 102kB 48.2MB/s eta 0:00:01
39% || 112kB 51.4MB/s eta 0:00:01
43% |#   | 122kB 53.2MB/s eta 0:00:01
46% |##  | 133kB 52.9MB/s eta 0:00:01
50% || 143kB 55.0MB/s eta 0:00:01
53% |#   | 153kB 55.5MB/s eta 0:00:01
57% |##  | 163kB 55.7MB/s eta 0:00:01
61% |### | 174kB 55.7MB/s eta 0:00:01
64% || 184kB 53.9MB/s eta 0:00:01
68% |#   | 194kB 55.1MB/s eta 0:00:01
71% |### | 204kB 54.8MB/s eta 0:00:01
75% 

Re: [FEEDBACK] Transitioning Aurora leader election to Apache Curator (`-zk_use_curator`)

2016-08-24 Thread Erb, Stephan
The curator backend has been working well for us so far. I believe it is safe 
to make it the default for the next release, and to drop the old code in the 
release after that.

From: John Sirois 
Reply-To: "u...@aurora.apache.org" , 
"jsir...@apache.org" 
Date: Thursday 7 July 2016 at 01:13
To: Martin Hrabovčin 
Cc: "dev@aurora.apache.org" , Jake Farrell 
, "u...@aurora.apache.org" 
Subject: Re: [FEEDBACK] Transitioning Aurora leader election to Apache Curator 
(`-zk_use_curator`)

Now that 0.15.0 has been released, I thought I'd check in on any progress folks 
have made with testing/deploying the 0.14.0+ with the Aurora Scheduler 
`-zk_use_curator` flag in-place.
There has been 1 fix that will go out in the 0.16.0 release to reduce logger 
noise on shutdown [1][2] but I have heard no negative (or positive) feedback 
otherwise.

[1] https://issues.apache.org/jira/browse/AURORA-1729
[2] https://reviews.apache.org/r/49578/

On Thu, Jun 16, 2016 at 1:18 PM, John Sirois 
> wrote:


On Thu, Jun 16, 2016 at 12:03 AM, Martin Hrabovčin 
> wrote:
How should be this flag rolled to existing running cluster? Can it be done 
using rolling update instance by instance or we need to stop the whole cluster 
and then bring all nodes with new flag?

I recommend a whole cluster down, upgrade +  new flag, up.

A rolling update should work, but will likely be rocky.  My analysis:

The Aurora leader election consists of 2 components, the actual leader election 
and the resulting advertisement by the leader of itself as the Aurora service 
endpoint.  These 2 components each use zookeeper and of the 2 I only ensured 
that the advertisement was compatible with old releases (old clients). The 
leader election portion is completely internal to the Aurora scheduler 
instances vying for leadership and, under Curator, uses a different (enhanced), 
zookeeper node scheme.  As a result, this is what could happen in a slow roll:

before upgrade: 0: old-lead, 1: old-follow, 2: old-follow
upgrade 0: new-lead, 1: old-lead, 2: old-follow

Here, node 0 will see itself as leader and nodes 1 and 2 will see node 1 as 
leader. The result will be both node 0 and node 1 attempting to read the mesos 
distributed log.  Now the log uses its own leader election and the reader must 
be the leader as things stand, so the Aurora-level leadership "tie" will be 
broken by one of the 2 Aurora-level leaders failing to become the mesos 
distributed log leader, and that node will restart its lifecycle - ie flap.  
This will continue to be the case with second node upgrade and will not 
stabilize until the 3rd node is upgraded.


2016-06-16 5:03 GMT+02:00 Jake Farrell 
>:
+1, will enable on our test clusters to help verify

-Jake

On Tue, Jun 14, 2016 at 7:43 PM, John Sirois 
> wrote:

> I'd like to move forward with
> https://issues.apache.org/jira/browse/AURORA-1669 asap; ie: removing
> legacy
> (Twitter) commons zookeeper libraries used for Aurora leader election in
> favor of Apache Curator libraries. The change submitted in
> https://reviews.apache.org/r/46286/ is now live in Aurora 0.14.0 and
> Apache
> Curator based service discovery can be enabled with the Aurora scheduler
> flag `-zk_use_curator`.  I'd like feedback from users who enable this
> option.  If you have a test cluster where you can enable `-zk_use_curator`
> and exercise leader failure and failover, I'd be grateful. If you have
> moved to using this option in production with demonstrable improvements or
> even maintenance of status quo, I'd also be grateful for this news. If
> you've found regressions or new bugs, I'd love to know about those as well.
>
> Thanks in advance to all those who find time to test this out on real
> systems!
>





Re: [VOTE] Release Apache Aurora 0.15.0 packages

2016-08-02 Thread Erb, Stephan
This vote is still open and the packages have not been released yet. Formally, 
everything looks correct to me. We should be able to proceed here, right? 

On 12/07/16 22:54, "Erb, Stephan" <stephan@blue-yonder.com> wrote:

+1

Ran the test instructions for all three architectures. There is a minor hiccup 
with the provision.sh for Debian, but nothing too serious 
https://gist.github.com/StephanErb/f96f2e4038499efba4fafebada58a9a7.

On 08/07/16 23:59, "John Sirois" <john.sir...@gmail.com> wrote:

Ah yes - and that's covered in the relase notes.  Sorry for the thrash.
+1

I was hurrying a bit to get this tested before going AFK and the lack of
packaging fixes for 0.15 on aurora-packaging master threw me.  Those fixes
should make it to master IIUC.

On Thu, Jul 7, 2016 at 11:53 AM, Maxim Khutornenko <ma...@apache.org> wrote:

> I have replied in https://reviews.apache.org/r/49732/. The main
> purpose of 0.15.0 was to upgrade to Mesos 0.28.2 and as such we must
> build against that version.
>
> On Wed, Jul 6, 2016 at 7:47 PM, John Sirois <john.sir...@gmail.com> wrote:
> > On Jul 6, 2016 8:34 PM, "John Sirois" <john.sir...@gmail.com> wrote:
> >>
> >> I missed Maxim's ~same fix here: https://reviews.apache.org/r/49732/
> >>
> >> Folks will need that patch to test, which is not on master, but on the
> > 0.15.x branch of aurora-packaging.
> >>
> >> On Wed, Jul 6, 2016 at 8:22 PM, John Sirois <john.sir...@gmail.com>
> wrote:
> >>>
> >>> +1 ... with minor nagging reservations.
> >
> > I updated my understanding on https://reviews.apache.org/r/49732/ which
> > changes my vote to -1. The >=0.28.2 mesos constraint does not seem to be
> > correct / in line with our standard compatibility claims.
> >
> >>>
> >>> I could only get the packages tested out after applying the
> provision.sh
> > fixes here: https://reviews.apache.org/r/49740/
> >>> I think the 0.15.0 release is supposed to be compatible back to mesos
> > 0.27(.2) though; thus the reservations.
> >>> I need to dig a bit more on the test failure (aurora-scheduler failed
> to
> > start on debian-jessie - did not try other platforms
> >>> before the linked RB fix), but I was wondering what other folks found
> > without the above linked RB patch.
> >>>
> >>> On Wed, Jul 6, 2016 at 6:40 PM, Maxim Khutornenko <ma...@apache.org>
> > wrote:
> >>>>
> >>>> All,
> >>>>
> >>>>
> >>>> I propose that we accept the following artifacts as the official deb
> >>>> and rpm packaging for Apache Aurora 0.15.0:
> >>>>
> >>>>
> >>>> https://dl.bintray.com/mkhutornenko/aurora/
> >>>>
> >>>>
> >>>> The Aurora deb and rpm packaging includes the following:
> >>>>
> >>>> ---
> >>>>
> >>>> The branch used to create the packaging is:
> >>>>
> >>>>
> >
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tree;hb=refs/heads/0.15.x
> >>>>
> >>>>
> >>>> The packages are available at:
> >>>>
> >>>> https://dl.bintray.com/mkhutornenko/aurora/
> >>>>
> >>>>
> >>>> The GPG keys used to sign the packages are available at:
> >>>>
> >>>> https://dist.apache.org/repos/dist/release/aurora/KEYS
> >>>>
> >>>>
> >>>> Please download, verify, and test. Detailed test instructions are
> > available here
> >>>>
> >>>>
> >
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tree;f=test;hb=refs/heads/0.15.x
> >>>>
> >>>>
> >>>> The vote will close on Mon Jul  11 18:55:16 PDT 2016
> >>>>
> >>>>
> >>>>
> >>>> [ ] +1 Release these as the deb and rpm packages for Apache Aurora
> > 0.15.0
> >>>>
> >>>> [ ] +0
> >>>>
> >>>> [ ] -1 Do not release these artifacts because...
> >>>>
> >>>>
> >>>> I would like to get the voting started off with my own +1
> >>>
> >>>
> >>
>






Re: Dynamic Reservation Meeting

2016-08-02 Thread Erb, Stephan
Making your meeting public is a great service to the community. Thanks for that!

I wish all of you a productive meeting :-) 

On 02/08/16 01:08, "Mehrdad Nurolahzade"  wrote:

Hi Devs,

Folks here at Twitter (Maxim and myself) will be meeting with Uber folks
(Dmitriy and Zameer) over lunch to discuss *Dynamic Reservation*. This
meeting will be open to the Apache Aurora Community via Google Hangouts.

Please join us If you want to share your take on this feature, curious to
learn more or eager to contribute.

*Time:* Tuesday Aug 2, 2016 12:00pm PST
*Google Hangouts link:*
https://plus.google.com/hangouts/_/twitter.com/apache-aurora?authuser=0=bW51cm9sYWh6YWRlQHR3aXR0ZXIuY29t.js5aj1kn1ecgac5t3fqsjsv0nc
*Dynamic Reservation RFC:*
https://docs.google.com/document/d/15n29HSQPXuFrnxZAgfVINTRP1Iv47_jfcstJNuMwr5A/edit#

Cheers,
Mehrdad




Re: Supporting the mesos HTTP executor API

2016-07-13 Thread Erb, Stephan
The HTTP executor API sounds interesting and I would like to see it land in 
Aurora one day.

As of today, I see two major hurdles with the libmesos dependency of the 
executor:

a) Users have to ship Mesos dependencies within their Docker images
b) We need to build and upload the libmesos eggs whenever we want to upgrade to 
a new Mesos version

Fortunately, even without the HTTP API, issue a) will be addressed by the 
upcoming universal containerizer support in Aurora. The executor will be 
running within the host filesystem rather than within the container filesystem. 

On 11/07/16 16:05, "Steve Niemitz"  wrote:

Has anyone begun thinking about implementing support for the HTTP executor
API in mesos?  Now that we require 0.28.0+, the API is supported, although
"experimental" until 1.0.

Dropping the executor's requirement for libmesos would be amazing.




Re: [VOTE] Release Apache Aurora 0.15.0 packages

2016-07-12 Thread Erb, Stephan
+1

Ran the test instructions for all three architectures. There is a minor hiccup 
with the provision.sh for Debian, but nothing too serious 
https://gist.github.com/StephanErb/f96f2e4038499efba4fafebada58a9a7.

On 08/07/16 23:59, "John Sirois"  wrote:

Ah yes - and that's covered in the relase notes.  Sorry for the thrash.
+1

I was hurrying a bit to get this tested before going AFK and the lack of
packaging fixes for 0.15 on aurora-packaging master threw me.  Those fixes
should make it to master IIUC.

On Thu, Jul 7, 2016 at 11:53 AM, Maxim Khutornenko  wrote:

> I have replied in https://reviews.apache.org/r/49732/. The main
> purpose of 0.15.0 was to upgrade to Mesos 0.28.2 and as such we must
> build against that version.
>
> On Wed, Jul 6, 2016 at 7:47 PM, John Sirois  wrote:
> > On Jul 6, 2016 8:34 PM, "John Sirois"  wrote:
> >>
> >> I missed Maxim's ~same fix here: https://reviews.apache.org/r/49732/
> >>
> >> Folks will need that patch to test, which is not on master, but on the
> > 0.15.x branch of aurora-packaging.
> >>
> >> On Wed, Jul 6, 2016 at 8:22 PM, John Sirois 
> wrote:
> >>>
> >>> +1 ... with minor nagging reservations.
> >
> > I updated my understanding on https://reviews.apache.org/r/49732/ which
> > changes my vote to -1. The >=0.28.2 mesos constraint does not seem to be
> > correct / in line with our standard compatibility claims.
> >
> >>>
> >>> I could only get the packages tested out after applying the
> provision.sh
> > fixes here: https://reviews.apache.org/r/49740/
> >>> I think the 0.15.0 release is supposed to be compatible back to mesos
> > 0.27(.2) though; thus the reservations.
> >>> I need to dig a bit more on the test failure (aurora-scheduler failed
> to
> > start on debian-jessie - did not try other platforms
> >>> before the linked RB fix), but I was wondering what other folks found
> > without the above linked RB patch.
> >>>
> >>> On Wed, Jul 6, 2016 at 6:40 PM, Maxim Khutornenko 
> > wrote:
> 
>  All,
> 
> 
>  I propose that we accept the following artifacts as the official deb
>  and rpm packaging for Apache Aurora 0.15.0:
> 
> 
>  https://dl.bintray.com/mkhutornenko/aurora/
> 
> 
>  The Aurora deb and rpm packaging includes the following:
> 
>  ---
> 
>  The branch used to create the packaging is:
> 
> 
> >
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tree;hb=refs/heads/0.15.x
> 
> 
>  The packages are available at:
> 
>  https://dl.bintray.com/mkhutornenko/aurora/
> 
> 
>  The GPG keys used to sign the packages are available at:
> 
>  https://dist.apache.org/repos/dist/release/aurora/KEYS
> 
> 
>  Please download, verify, and test. Detailed test instructions are
> > available here
> 
> 
> >
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tree;f=test;hb=refs/heads/0.15.x
> 
> 
>  The vote will close on Mon Jul  11 18:55:16 PDT 2016
> 
> 
> 
>  [ ] +1 Release these as the deb and rpm packages for Apache Aurora
> > 0.15.0
> 
>  [ ] +0
> 
>  [ ] -1 Do not release these artifacts because...
> 
> 
>  I would like to get the voting started off with my own +1
> >>>
> >>>
> >>
>




Multi-Role Frameworks

2016-07-10 Thread Erb, Stephan
Some people started an initiative to support frameworks with multiple Mesos 
roles:


· Epic: https://issues.apache.org/jira/browse/MESOS-1763

· Design doc: 
https://docs.google.com/document/d/1gnDdADUMhPgvPa_lN7y97riUEhTCxmrLinOZM471HLc/edit?pref=2=1#heading=h.czfhotx8xvs5

I believe we should have a close eye on this. It could potentially enable us to 
establish a 1-1 mapping from Aurora roles to Mesos roles and thus allow us to 
handle certain features like quotas or per-role reservations in Mesos, rather 
than having to re-implement them in Aurora.

I haven’t looked into the details yet, but it seemed like it is worth sharing.
Best Regards,
Stephan


Re: Build failed in Jenkins: Aurora #1548

2016-07-04 Thread Erb, Stephan
Looks like src.test.python.apache.thermos.core.test_process.test_log_tee is 
flaky. I have seen it fail at least twice.


On 04/07/16 16:46, "Apache Jenkins Server"  wrote:

See 

Changes:

[john.sirois] Close `PathChildrenCache` before its framework.

--
[...truncated 3553 lines...]
 
src/test/python/apache/thermos/observer/test_observer_task_detector.py::test_observer_task_detector_standard_transitions
 ?[32mPASSED?[0m
 
src/test/python/apache/thermos/observer/test_observer_task_detector.py::test_observer_task_detector_nonstandard_transitions
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_updater_util.py::TestRangeConversion::test_empty_list
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_updater_util.py::TestRangeConversion::test_multiple_ranges
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_updater_util.py::TestRangeConversion::test_none_list
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_updater_util.py::TestRangeConversion::test_one_element
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_updater_util.py::TestRangeConversion::test_pulse_interval_secs
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_updater_util.py::TestRangeConversion::test_pulse_interval_too_low
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_updater_util.py::TestRangeConversion::test_pulse_interval_unset
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_job_monitor.py::JobMonitorTest::test_empty_job_succeeds
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_job_monitor.py::JobMonitorTest::test_terminate
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_job_monitor.py::JobMonitorTest::test_terminated_exits_immediately
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_job_monitor.py::JobMonitorTest::test_wait_until_state
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_job_monitor.py::JobMonitorTest::test_wait_until_timeout
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_job_monitor.py::JobMonitorTest::test_wait_with_instances
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_health_check.py::HealthCheckTest::test_changed_task_id
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_health_check.py::HealthCheckTest::test_failed_status_health_check
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_health_check.py::HealthCheckTest::test_simple_status_health_check
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_api.py::TestJobUpdateApis::test_add_instances
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_api.py::TestJobUpdateApis::test_get_job_update_details
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_api.py::TestJobUpdateApis::test_get_job_update_diff
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_api.py::TestJobUpdateApis::test_pause_job_update
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_api.py::TestJobUpdateApis::test_query_job_updates
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_api.py::TestJobUpdateApis::test_query_job_updates_no_filter
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_api.py::TestJobUpdateApis::test_resume_job_update
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_api.py::TestJobUpdateApis::test_set_quota
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_api.py::TestJobUpdateApis::test_start_job_update
 ?[32mPASSED?[0m
 
src/test/python/apache/aurora/client/api/test_api.py::TestJobUpdateApis::test_start_job_update_fails_parse_update_config
 ?[32mPASSED?[0m
 src/test/python/apache/aurora/client/api/test_restarter.py 
<- 
.pants.d/python-setup/chroots/409271aa2fa0ddb2b96e5056afbdff1b6bd7802b/.deps/mox-0.5.3-py2-none-any.whl/mox.py::TestRestarter::test_restart_all_instances
 ?[32mPASSED?[0m
 src/test/python/apache/aurora/client/api/test_restarter.py 
<- 
.pants.d/python-setup/chroots/409271aa2fa0ddb2b96e5056afbdff1b6bd7802b/.deps/mox-0.5.3-py2-none-any.whl/mox.py::TestRestarter::test_restart_instance_fails
 ?[32mPASSED?[0m
 src/test/python/apache/aurora/client/api/test_restarter.py 
<- 

Re: [RESULT][VOTE] Release Apache Aurora 0.14.0 packages

2016-07-03 Thread Erb, Stephan
This vote has passed.  Vote summary:

+1 votes: 3 binding
+0 votes: 0
-1 votes: 0

Artifacts are now available in the official Apache Aurora bintray repos:
https://bintray.com/apache/aurora/ 

From: John Sirois <jsir...@apache.org>
Sent: Thursday, June 30, 2016 02:01
To: dev@aurora.apache.org
Subject: Re: [VOTE] Release Apache Aurora 0.14.0 packages

+1 as verified for debian-jessie, ubuntu-trusty and centos-7

On Wed, Jun 29, 2016 at 5:11 PM, Maxim Khutornenko <ma...@apache.org> wrote:

> +1
>
> Verified for centos-7 and ubuntu-trusty.
>
> On Wed, Jun 29, 2016 at 2:20 PM, Erb, Stephan
> <stephan@blue-yonder.com> wrote:
> > All,
> >
> > I propose that we accept the following artifacts as the official deb and
> rpm packaging for Apache Aurora 0.14.0.
> >
> > https://dl.bintray.com/stephanerb/aurora/
> >
> > The Aurora deb and rpm packaging includes the following:
> > ---
> > The CHANGELOGs are available at:
> >
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=blob_plain;f=specs/debian/changelog;hb=refs/heads/0.14.x
> >
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=blob_plain;f=specs/rpm/aurora.spec;hb=refs/heads/0.14.x
> >
> > The branch used to create the packaging is:
> >
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tree;h=refs/heads/0.14.x
> >
> > The packages are available at:
> > https://dl.bintray.com/stephanerb/aurora/
> >
> > The GPG keys used to sign the packages are available at:
> > https://dist.apache.org/repos/dist/release/aurora/KEYS
> >
> > Please download, verify, and test. Detailed test instructions are
> available here
> >
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tree;f=test;hb=refs/heads/0.14.x
> >
> > The vote will close on Sa 2. Jul 22:00:00 CEST 2016
> >
> > [ ] +1 Release these as the deb and rpm packages for Apache Aurora 0.14.0
> > [ ] +0
> > [ ] -1 Do not release these artifacts because...
> >
> > I would like to get the voting started off with my own +1
> >
> > Best Regards,
> > Stephan
>


[RESULT][VOTE] Release Apache Aurora 0.13.0 packages

2016-06-29 Thread Erb, Stephan
This vote has passed.  Vote summary:

+1 votes: 4 binding
+0 votes: 0
-1 votes: 0

Artifacts are now available in the official Apache Aurora bintray repos:
https://bintray.com/apache/aurora/ 


From: Joshua Cohen 
Sent: Wednesday, June 29, 2016 21:22
To: dev@aurora.apache.org; John Sirois
Subject: Re: [VOTE] Release Apache Aurora 0.13.0 packages

+1, looks good to me.

On Wed, Jun 29, 2016 at 1:16 PM, John Sirois  wrote:

> +1 the aurora-packaging test instructions worked for me for debian-jessie,
> ubuntu-trusty and centos-7
>
> On Wed, Jun 29, 2016 at 11:39 AM, Maxim Khutornenko 
> wrote:
>
> > +1
> >
> > Tested with pkg_root="https://dl.bintray.com/stephanerb/aurora/centos-7/
> "
> >
> > On Mon, Jun 20, 2016 at 3:56 PM, Stephan Erb  wrote:
> > > Unfortunately, I managed to break the links using newlines. Please be
> > > mindful when trying to follow them.
> > >
> > >
> > > On Di, 2016-06-21 at 00:53 +0200, Stephan Erb wrote:
> > >> All,
> > >>
> > >> I propose that we accept the following artifacts as the official deb
> > >> and rpm packaging for Apache Aurora 0.13.0.
> > >>
> > >> https://dl.bintray.com/stephanerb/aurora/
> > >>
> > >> The Aurora deb and rpm packaging includes the following:
> > >> ---
> > >> The CHANGELOGs are available at:
> > >> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=bl
> > >> ob_plain;f=specs/debian/changelog;hb=refs/heads/0.13.x
> > >> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=bl
> > >> ob_plain;f=specs/rpm/aurora.spec;hb=refs/heads/0.13.x
> > >>
> > >> The branch used to create the packaging is:
> > >> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tr
> > >> ee;h=refs/heads/0.13.x
> > >>
> > >> The packages are available at:
> > >> https://dl.bintray.com/stephanerb/aurora/
> > >>
> > >> The GPG keys used to sign the packages are available at:
> > >> https://dist.apache.org/repos/dist/release/aurora/KEYS
> > >>
> > >> Please download, verify, and test. Detailed test instructions are
> > >> available here
> > >> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tr
> > >> ee;f=test;hb=refs/heads/0.13.x.
> > >>
> > >> The vote will close on Fr 24. Jun 00:01:00 CEST 2016
> > >>
> > >> [ ] +1 Release these as the deb and rpm packages for Apache Aurora
> > >> 0.13.0
> > >> [ ] +0
> > >> [ ] -1 Do not release these artifacts because...
> > >>
> > >> I would like to get the voting started off with my own +1
> > >>
> > >> Best Regards,
> > >> Stephan
> >
>


Re: [VOTE] Release Apache Aurora 0.14.0 packages

2016-06-29 Thread Erb, Stephan
All,

I propose that we accept the following artifacts as the official deb and rpm 
packaging for Apache Aurora 0.14.0.

https://dl.bintray.com/stephanerb/aurora/

The Aurora deb and rpm packaging includes the following:
---
The CHANGELOGs are available at:
https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=blob_plain;f=specs/debian/changelog;hb=refs/heads/0.14.x
https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=blob_plain;f=specs/rpm/aurora.spec;hb=refs/heads/0.14.x

The branch used to create the packaging is:
https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tree;h=refs/heads/0.14.x

The packages are available at:
https://dl.bintray.com/stephanerb/aurora/

The GPG keys used to sign the packages are available at:
https://dist.apache.org/repos/dist/release/aurora/KEYS

Please download, verify, and test. Detailed test instructions are available here
https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tree;f=test;hb=refs/heads/0.14.x

The vote will close on Sa 2. Jul 22:00:00 CEST 2016

[ ] +1 Release these as the deb and rpm packages for Apache Aurora 0.14.0
[ ] +0
[ ] -1 Do not release these artifacts because...

I would like to get the voting started off with my own +1

Best Regards,
Stephan


Re: Few things we would like to support in aurora scheduler

2016-06-19 Thread Erb, Stephan

>> The next problem is related to the way we collect  service cluster
>> status. I couldn't find a way to quickly get latest statuses for all
>> instances/shards for a job in one query. Instead we query all task statuses
>> for a job, them manually iterate through all the statuses and filter the
>> latest one in grouped by instance ids. For services with lots of churn on
>> tasks statuses that means huge blobs of thrift transferred every time we
> issue a query. I was thinking adding something in this line:
>
>
>Does a TaskQuery filtering by job key and ACTIVE_STATES solve this?  Still
>includes the TaskConfig, but it's a single query, and probably rarely
>exceeds 1 MB in response payload.

We have a related problem, where we are interested in the status of the last 
executed cron job. Unfortunately, ACTIVE_STATES don’t help here. One potential 
solution I have thought about was a flag in TaskQuery for enabling server-side 
sorting of tasks by their latest event time. We could then query the status of 
the latest run by using this flag in combination with limit=1. This could also 
be composed by the limit_per_instance flag to guarantee the usecase mentioned 
here.



On Thu, Jun 16, 2016 at 1:28 PM, Igor Morozov  wrote:

> Hi aurora people,
>
> I would like to start a discussion around few things we would like to see
> supported in aurora scheduler. It is based on our experience of integrating
> aurora into Uber infrastructure and I believe all the items I'm going to
> talk about will benefit the community and people running aurora clusters.
>
> 1. We support multiple aurora clusters in different failure domains and we
> run services in those domains. The upgrade workflow for those services
> includes rolling out the same version of a service software to all aurora
> clusters concurrently while monitoring the health status and other service
> vitals that includes like checking error logs, service stats,
> downstream/upstream services health. That means we occasionally need to
> manually trigger a rollback if things go south and rollback all the update
> jobs in all aurora clusters for that particular service. So here are the
> problems we discovered so far with this approach:
>
>- We don't have an easy way to assign a common unique identifier for
> all JobUpdates in different aurora clusters in order to reconcile them
> later into a single meta update job so to speak. Instead we need to
> generate that ID and keep it in every aurora's JobUpdate
> metadata(JobUpdateRequest.taskConfig). Then in order to get the status the
> upgrade workflow running in different data centers we have to query all
> recent jobs and based on their metadata content try to filter in ones that
> we thing belongs to a currently running upgrade for the service.
>
> We propose to change
> struct JobUpdateRequest {
>   /** Desired TaskConfig to apply. */
>   1: TaskConfig taskConfig
>
>   /** Desired number of instances of the task config. */
>   2: i32 instanceCount
>
>   /** Update settings and limits. */
>   3: JobUpdateSettings settings
>
> *  /**Optional Job Update key's id, if not specified aurora will generate
> one**/*
>
> *  4: optional string id*}
>
> There is potentially another much more involved solution of supporting user
> defined metadata mentioned in this ticket:
> https://issues.apache.org/jira/browse/AURORA-1711
>
>
> -  All that brings us to a second problem we had to deal with during
> the upgrade:
> We don't have a good way to manually trigger a job update rollback in
> aurora. The use case is again the same, while running multiple update jobs
> in different aurora clusters we have a real production requirement to start
> rolling back update jobs if things are misbehaving and the nature of this
> misbehavior could be potentially very complex. Currently we abort the job
> update and start a new one that would essentially roll cluster forward to a
> previously run version of the software.
>
> We propose a new convenience API to rollback a running or complete
> JobUpdate:
>
> *  /**Rollback job update. */*
> *  Response rollbackJobUpdate(*
> *  /** The update to rollback. */*
> *  1: JobUpdateKey key,*
> *  /** A user-specified message to include with the induced job update
> state change. */*
> *  3: string message)*
>
> 2. The next problem is related to the way we collect  service cluster
> status. I couldn't find a way to quickly get latest statuses for all
> instances/shards for a job in one query. Instead we query all task statuses
> for a job, them manually iterate through all the statuses and filter the
> latest one in grouped by instance ids. For services with lots of churn on
> tasks statuses that means huge blobs of thrift transferred every time we
> issue a query. I was thinking adding something in this line:
> struct TaskQuery {
>   // TODO(maxim): Remove in 0.7.0. (AURORA-749)
>   8: Identity owner
>   14: string role
>   9: string environment
>   2: string 

Re: aurora-packaging for 0.13

2016-06-14 Thread Erb, Stephan
I am way behind my plan for the 0.13 binaries. There is one pending patch I'd 
like to land before building and publishing the 0.13er binaries for a vote [1].

Once the 0.13 is out, I'd look into 0.14. 

If you are interested and eager, you could speed this up by already preparing a 
review request that would bring https://github.com/apache/aurora-packaging into 
a state ready for 0.14.0. I believe the only major necessary changes will be to 
bump the various Aurora and Mesos version string scattered throughout the 
repository.

[1] https://reviews.apache.org/r/48606/

From: Mauricio Garavaglia <mauriciogaravag...@gmail.com>
Sent: Wednesday, June 15, 2016 00:01
To: dev@aurora.apache.org
Subject: Re: aurora-packaging for 0.13

Hi!
Given that we have aurora 0.14 now, can you add the proper branches/tags to
aurora-packaging to build it? Thanks!

Mauricio

On Mon, Jun 6, 2016 at 4:37 AM, Erb, Stephan <stephan@blue-yonder.com>
wrote:

> Hi Mauricio,
>
> the master of https://github.com/apache/aurora-packaging should be ready
> to use if you want to build 0.13 binaries.
>
> There are some minor cleanups needed before we can do an official release
> though. I will try to look into this until next week.
>
> Best Regards,
> Stephan
> 
> From: Mauricio Garavaglia <mauriciogaravag...@gmail.com>
> Sent: Monday, May 30, 2016 22:19
> To: dev@aurora.apache.org
> Subject: aurora-packaging for 0.13
>
> Hi guys,
>
> Looks like the aurora-packaging repo (
> https://github.com/apache/aurora-packaging) doesn't include a branch for
> the 0.13.x release, as it happened for previous releases, and the changes
> got merged right into master. Could at least consider tagging for the
> particular version intented to be released? Thanks
>
> Mauricio
>


Re: Aurora 0.14.0 release

2016-06-09 Thread Erb, Stephan
I will wait until we have addressed 
https://issues.apache.org/jira/browse/AURORA-1709 before tagging.


From: Joshua Cohen <jco...@apache.org>
Sent: Tuesday, June 7, 2016 16:21
To: dev@aurora.apache.org; Jake Farrell
Subject: Re: Aurora 0.14.0 release

+1 to releasing 0.14.0 and thanks for volunteering to act as RM, Stephan!

On Tue, Jun 7, 2016 at 8:52 AM, Jake Farrell <jfarr...@apache.org> wrote:

> Take a look at the "Creating a release" section in docs/development/
> committers-guide.md, you will also need to add you GPG key to the
> necessary
> files, feel free to ping me with any questions or bring them up in #aurora
> as a number of us have now been RM's before
>
> -Jake
>
> On Mon, Jun 6, 2016 at 6:54 PM, Erb, Stephan <stephan@blue-yonder.com>
> wrote:
>
> > +1 for a RC this week.
> >
> > I'd volunteer to serve as a release manager, but will probably need
> > someone to walk me through the necessary steps.
> > 
> > From: Maxim Khutornenko <ma...@apache.org>
> > Sent: Tuesday, June 7, 2016 00:15
> > To: dev@aurora.apache.org
> > Subject: Aurora 0.14.0 release
> >
> > With the resource refactoring and GPU support landed plus appc
> > container ongoing work presumably not requiring any further schema
> > changes (Joshua, please correct if that's not the case), how does
> > everyone feel about cutting a release candidate this week?
> >
> > Assuming no objections: anyone wants to be a release manager for 0.14.0?
> >
>

Re: Aurora 0.14.0 release

2016-06-06 Thread Erb, Stephan
+1 for a RC this week.

I'd volunteer to serve as a release manager, but will probably need someone to 
walk me through the necessary steps.

From: Maxim Khutornenko 
Sent: Tuesday, June 7, 2016 00:15
To: dev@aurora.apache.org
Subject: Aurora 0.14.0 release

With the resource refactoring and GPU support landed plus appc
container ongoing work presumably not requiring any further schema
changes (Joshua, please correct if that's not the case), how does
everyone feel about cutting a release candidate this week?

Assuming no objections: anyone wants to be a release manager for 0.14.0?


Re: Log aggregation to Kafka

2016-05-02 Thread Erb, Stephan
As of today, this is not possible with the thermos logger. However, I can think 
of a two of potential solutions:

*  you start an additional (ephemeral) process together with you job that 
forwards the content of the stdout & stderr files to Kafka 
(https://github.com/apache/aurora/blob/17ade117b8171d4ef4be62fd4b617a024658551d/docs/reference/configuration.md#ephemeral)
* you adjust the code you start on Aurora to automatically ship your logs to 
Kafka or any other log aggregator you want to use. This tends to have the 
advantage that you can do structured logging. 

Kind Regards,
Stephan

From: thinker0 
Sent: Friday, April 29, 2016 16:03
To: dev@aurora.apache.org
Subject: Log aggregation to Kafka

hi.

Thermos Logger wish I had the ability to transfer to Kafka.

Re: Aurora and Mesos

2016-05-02 Thread Erb, Stephan
Hi Supun,

Aurora is a Mesos framework and therefore does not work without Mesos. Details 
how Mesos & Aurora work together can be found here 
https://github.com/apache/aurora/blob/master/docs/getting-started/overview.md

It is possible to install Mesos masters and Aurora schedulers on different 
nodes. However, code started via Aurora will always need to run on nodes 
running the Mesos slave/agent binary.

Best Regards,
Stephan

From: Supun Kamburugamuve 
Sent: Saturday, April 30, 2016 06:13
To: dev@aurora.apache.org
Subject: Aurora and Mesos

Hi Devs,

I'm new to aurora. I have a research project where I want to run a Aurora
on a different cluster than Mesos. Is the coupling between Aurora and Mesos
is too tight so that two cannot be separated or you have provisions to
support other systems like Mesos?

Thanks,
Supun..

Re: [PROPOSAL] Switch aurora client from service discovery to HTTP redirects.

2016-04-20 Thread Erb, Stephan
So, scheduler_uri would point to a domain name with multiple A-records? 

We could probably also extend the interface to support a list of scheduler 
uris. That would make an initial HA-setup simpler, as it would not require the 
DNS entries or a load balancer. People could just use a list of IPs.

From: John Sirois 
Sent: Tuesday, April 19, 2016 19:37
To: Joshua Cohen
Cc: dev@aurora.apache.org
Subject: Re: [PROPOSAL] Switch aurora client from service discovery to HTTP 
redirects.

On Mon, Apr 18, 2016 at 4:37 PM, Joshua Cohen  wrote:

> And apparently this is not part of our fork at all, the client already
> supports this today! The only potential change that would be required would
> be ensuring the client properly follows redirects.
>

Aha - yes.  Thank you!

Redirect follows are indeed not enabled on the requests calls today
explicitly, but they are enabled implicitly for all non-HEAD requests IIUC.

So I think there is a small amount of work to maybe test redirection
actually works and deprecate the service discovery codepath in before
removing support.
Since there is support for the idea in the codebase already and you've
indicated no objection I'll proceed to draw up some RBs.


> On Mon, Apr 18, 2016 at 5:35 PM, Joshua Cohen  wrote:
>
>> Er, it's not `proxy_url`, it's  `scheduler_uri` (which makes much more
>> sense ;)).
>>
>> On Mon, Apr 18, 2016 at 5:34 PM, Joshua Cohen  wrote:
>>
>>> I'm not opposed to this, in fact we already do something similar
>>> internally. We've forked clusters.py to allow configuring a `proxy_url` for
>>> each cluster. If that's present, then the client will use it rather than
>>> performing service discovery to communicate with the scheduler.
>>>
>>> On Mon, Apr 18, 2016 at 2:03 PM, John Sirois  wrote:
>>>
 As part of work on switching the Aurora scheduler from using a copy of
 the
 Twitter zookeeper libs to the Apache Curator libs [1], I'd like to
 enable a
 Curator feature (GUID protection [2]) that will make the scheduler more
 robust to ZooKeeper outages but has the side-effect of breaking the
 effectively public API formed by the ZooKeeper path names it uses to
 perform leader election and service advertisement.  From Aurora's
 standpoint, the API consumer is the Aurora command-line client.  It uses
 service-discovery today to find the leading scheduler to communicate
 with.
 I propose switching the client to use HTTP re-directs instead since the
 scheduler already has a redirect service and since this would reduce the
 scheduler API surface area and generally move further down the road of a
 pure HTTP api.

 The most obvious problem I see here is just the mechanics of a proper
 deprecation cycle for all those clients Aurora is not directly aware of
 that rely on its service advertisment API in ZooKeeper today.

 Are there other problems folks see with this?

 [1] https://issues.apache.org/jira/browse/AURORA-1468
 [2] https://issues.apache.org/jira/browse/AURORA-1670

>>>
>>>
>>
>


Re: Populate DiscoveryInfo in Mesos

2016-04-05 Thread Erb, Stephan
There has been some promising progress on the review request: 
https://reviews.apache.org/r/45177/ 

Has anyone else comments, or identified a blocking issue? Otherwise, this 
beta-feature is close to merging, probably even before the RC planned for 
tomorrow.

From: Zhitao Li <zhitaoli...@gmail.com>
Sent: Friday, April 1, 2016 03:10
To: dev@aurora.apache.org
Subject: Re: Populate DiscoveryInfo in Mesos

Benjamin,

You are exactly right. The problem is on Mesos DNS side because it has its
own rules of shortening names and replacing dots to other characters.

IMO, relying one generating one "name" which would be useful for all
systems may be idealistic. I like the "label" concept in recent
Mesos/Docker systems, and probably Mesos DNS should take an optional label
to allow its user to customize the behavior, and Aurora could easily adopt
that: e.g. duplicate labels from TaskInfo to DiscoveryInfo.

Right now, the only open sourced project using DiscoveryInfo is Mesos DNS,
so there is not real convention in the community yet.


On Thu, Mar 31, 2016 at 5:39 PM, ben...@gmail.com <ben...@gmail.com> wrote:

> FYI, Aurora already populates the "executor source" field (not sure exactly
> what that corresponds to in mesos.proto) with exactly the data you would
> want to send to mesos-dns: rolename.environment.jobname.[tasknumber] for
> each task.  Maybe you would need to invert the order of the fields, but
> that's pretty much the right thing.
>
> On Thu, Mar 31, 2016 at 12:53 PM Zhitao Li <zhitaoli...@gmail.com> wrote:
>
> > Hi Stephan,
> >
> > I like your proposal, but I think they all require some changes on Mesos
> > DNS to support this level of customization. I've filed a github issue to
> > mesos-dns <https://github.com/mesosphere/mesos-dns/issues/414> to
> describe
> > what I want.
> >
> > I've updated my patch to include unit test and command flag switch, and
> > it's ready for review now.
> >
> > On Thu, Mar 31, 2016 at 2:32 AM, Erb, Stephan <
> stephan@blue-yonder.com
> > >
> > wrote:
> >
> > > If I understand your example correctly, the underling jobkey used to
> > > generate "vagranttesthttp-exampled.twitterscheduler.mesos" was
> > > "vagrant/test/http-exampled" and what we actually put into the
> > > DiscoveryInfo is "vagrant.test.http-exampled".
> > >
> > > So how about:
> > > * we inject inverse names . So for example:
> > > "http-exampled.test.vagrant"
> > > * we teach mesos-DNS that it should not silently drop dots in our names
> > >
> > > That should provide us with hierarchical, collision free DNS names such
> > as
> > > "http-exampled.test.vagrant.twitterscheduler.mesos".
> > >
> > > Bonus points if we get "twitterscheduler" replaced by the actual
> cluster
> > > name.
> > >
> > > 
> > > From: Zhitao Li <zhitaoli...@gmail.com>
> > > Sent: Thursday, March 31, 2016 01:08
> > > To: dev@aurora.apache.org
> > > Subject: Re: Populate DiscoveryInfo in Mesos
> > >
> > > On Wed, Mar 30, 2016 at 3:58 PM, Joshua Cohen <jco...@apache.org>
> wrote:
> > >
> > > > Job names are not unique though, what would happen if multiple jobs
> had
> > > the
> > > > same name (either across roles or across environments in the same
> > role)?
> > > >
> > >
> > > Good point. They would conflict with each other, and I guess in that
> case
> > > Mesos DNS should not be used with the cluster.
> > >
> > > An alternative is {role}-{job name}, although there are still ways to
> > > create conflict in such case (e.g. "role-dumy/test/job" and
> > > "role/test/dummy-job" generates the same name).
> > >
> > > I think the correct long term approach is to allow some way to
> configure
> > > this information by task or job. I'm a bit hesitant to include new
> thrift
> > > structures for this experiment, and maybe the idea of
> "TaskInfoDecorator"
> > > (see my previous posts) would be more flexible?
> > >
> > >
> > > >
> > > > On Wed, Mar 30, 2016 at 5:33 PM, Zhitao Li <zhitaoli...@gmail.com>
> > > wrote:
> > > >
> > > > > Stephan,
> > > > >
> > > > > So I've managed to run the official Mesos DNS docker container
> > > > > <https://hu

Re: Are we ready to remove the observer?

2016-04-01 Thread Erb, Stephan
>From an operator and Aurora developer perspective, it would be really great to 
>get rid of the thermos observer quickly.

However, from a user perspective the usability gap between observer and plain 
Mesos sandbox browsing is quite large right now. I agree with Benjamin here 
that it would probably work if we generate html pages ready for user 
consumption. 

These are the relevant tickets in our tracker: 
* https://issues.apache.org/jira/browse/AURORA-725
* https://issues.apache.org/jira/browse/AURORA-777 


From: ben...@gmail.com 
Sent: Friday, April 1, 2016 02:35
To: dev@aurora.apache.org
Subject: Re: Are we ready to remove the observer?

Is there any chance we can keep the per-process cpu and ram utilization
stats?  That's one of the coolest things about aurora, imo.  The executor
is already writing those checkpoints inside the mesos sandbox (I think?),
so perhaps it could also produce the html pages that the observer currently
renders?

On Thu, Mar 31, 2016 at 4:33 PM Zhitao Li  wrote:

> +1.
>
> On Thu, Mar 31, 2016 at 4:11 PM, Bill Farner  wrote:
>
> > Assuming that the vast majority of utility provided by the observer is
> > sandbox/log browsing - can we remove it and link to sandbox browsing that
> > mesos provides?
> >
> > The rest of the information could be (or already is) logged in the
> sandbox
> > for the rare debugging scenarios that call for it.
> >
>
>
>
> --
> Cheers,
>
> Zhitao Li
>

Re: Populate DiscoveryInfo in Mesos

2016-03-31 Thread Erb, Stephan
If I understand your example correctly, the underling jobkey used to generate 
"vagranttesthttp-exampled.twitterscheduler.mesos" was 
"vagrant/test/http-exampled" and what we actually put into the DiscoveryInfo is 
"vagrant.test.http-exampled".

So how about:
* we inject inverse names . So for example: 
"http-exampled.test.vagrant"
* we teach mesos-DNS that it should not silently drop dots in our names

That should provide us with hierarchical, collision free DNS names such as 
"http-exampled.test.vagrant.twitterscheduler.mesos".

Bonus points if we get "twitterscheduler" replaced by the actual cluster name.


From: Zhitao Li <zhitaoli...@gmail.com>
Sent: Thursday, March 31, 2016 01:08
To: dev@aurora.apache.org
Subject: Re: Populate DiscoveryInfo in Mesos

On Wed, Mar 30, 2016 at 3:58 PM, Joshua Cohen <jco...@apache.org> wrote:

> Job names are not unique though, what would happen if multiple jobs had the
> same name (either across roles or across environments in the same role)?
>

Good point. They would conflict with each other, and I guess in that case
Mesos DNS should not be used with the cluster.

An alternative is {role}-{job name}, although there are still ways to
create conflict in such case (e.g. "role-dumy/test/job" and
"role/test/dummy-job" generates the same name).

I think the correct long term approach is to allow some way to configure
this information by task or job. I'm a bit hesitant to include new thrift
structures for this experiment, and maybe the idea of "TaskInfoDecorator"
(see my previous posts) would be more flexible?


>
> On Wed, Mar 30, 2016 at 5:33 PM, Zhitao Li <zhitaoli...@gmail.com> wrote:
>
> > Stephan,
> >
> > So I've managed to run the official Mesos DNS docker container
> > <https://hub.docker.com/r/mesosphere/mesos-dns/> under the Aurora
> vagrant
> > environment and get some SRV/A recorded pulled from Mesos master from
> > Aurora.
> >
> > Because Mesos DNS uses 'name' field if set with some string manipulation,
> > for the job 'vagrant/test/http_example_docker', my prototype generates
> > these DNS records:
> >
> > A record: vagranttesthttp-exampled.twitterscheduler.mesos
> > SRV record: _vagranttesthttp-exampled._tcp.twitterscheduler.mesos.
> >
> > If we want to make current prototype useful for Mesos DNS, I suggest we
> > change the name field to job name, which would generate record like:
> > A: http_example_docker.twitterscheduler.mesos
> > SRV: _http_example_docker._tcp.twitterscheduler.slave.mesos
> >
> > I'll update my patch after getting some signal from you. Thanks.
> >
> > On Fri, Mar 25, 2016 at 1:49 PM, Zhitao Li <zhitaoli...@gmail.com>
> wrote:
> >
> > > Hi Stephan,
> > >
> > > Thanks for looking at that prototype patch.
> > >
> > > I'll update the patch with the review comments, and probably add a
> global
> > > flag of "populate_discovery_info" to toggle this behavior.
> > >
> > > About the optional fields: I think it'll be hard to come up a good set
> of
> > > rules applicable to all orgs using Aurora + Mesos, because cluster
> > > management and service discovery stack could differ from org to org.
> > >
> > > In a recent Mesos work group, some experience folks (Jie Yu and Ben
> > > Mahler) mentioned some ideas of *TaskInfoDecorator, *which is some
> > > optional and configurable plugin on Aurora scheduler side to allow
> > operator
> > > to set additional fields before sending the message to Mesos. I like
> such
> > > idea because it would enable Aurora users to experiment faster. Do you
> > > think this is an interesting idea worth pursuing?
> > >
> > >
> > > On Fri, Mar 25, 2016 at 1:42 PM, Erb, Stephan <
> > stephan@blue-yonder.com
> > > > wrote:
> > >
> > >> I had a closer look at the Mesos documentation, and a design document
> > >> might be unnecessary. Most of the values are optional. We can
> therefore
> > >> leave them out until we have a proper usecase for them.
> > >>
> > >> I left a couple of comments in the review request.
> > >> 
> > >> From: Zhitao Li <zhitaoli...@gmail.com>
> > >> Sent: Tuesday, March 22, 2016 21:15
> > >> To: dev@aurora.apache.org
> > >> Subject: Re: Populate DiscoveryInfo in Mesos
> > >>
> > >> Hi Stephan,
> > >>
> > >> Sorry for the delay on follow up 

Re: Looking for feedback - Setting CommandInfo.user by default when launching tasks.

2016-03-29 Thread Erb, Stephan
I am in favor of your proposal. We offer less attack surface if the executor is 
not running as root.

Interesting though, this introduces another security problem: The credentials 
file in the incoming Zookeeper  ACL patch (https://reviews.apache.org/r/45042/) 
will have to be readable by everyone. That feels a little bit like being back 
to square one.

From: Steve Niemitz 
Sent: Tuesday, March 29, 2016 17:34
To: dev@aurora.apache.org
Subject: Looking for feedback - Setting CommandInfo.user by default when 
launching tasks.

I've been working on some changes to how aurora submits tasks to mesos,
specifically around Docker tasks, but I'd also like to see how people feel
about making it more general.

Currently, when Aurora submits a task to mesos, it does NOT set
command.user on the ExecutorInfo, this means that mesos configures the
sandbox (mesos sandbox that is) as root, and launches the executor
(thermos_executor in our case) as root as well.

What then happens is that the executor then chown()s the sandbox it creates
to the aurora role/user, and also setuid()s the runners it forks to that
role/user.  However, the executor itself is still running as root.

My proposal / change is to set command.user to the aurora role by default,
which will cause the executor to run as that user.  I've tested this
already, and no changes are needed to the executor, it will still try to
chown the sandbox (which is fine since it already owns it), and setuid()
the runners it forks (again, fine, since they're already running as that
user).

*The controversial part of this* however is I'd like to enable this
behavior BY DEFAULT, and allow disabling it (reverting to the current
behavior now) via a flag to the scheduler.  My reasoning here is two fold.
 1) It's a more secure default, preventing a compromised executor from
doing things it shouldn't, and 2) we already have a lot of "flag bloat",
and flags are hard enough to discover as they are.  However, I do believe
this should be considered as a "breaking change", particularly because of
finicky PEX extraction for the executor.

I'd like to hear people's thoughts on this.

Re: Populate DiscoveryInfo in Mesos

2016-03-25 Thread Erb, Stephan
I had a closer look at the Mesos documentation, and a design document might be 
unnecessary. Most of the values are optional. We can therefore leave them out 
until we have a proper usecase for them.

I left a couple of comments in the review request. 

From: Zhitao Li <zhitaoli...@gmail.com>
Sent: Tuesday, March 22, 2016 21:15
To: dev@aurora.apache.org
Subject: Re: Populate DiscoveryInfo in Mesos

Hi Stephan,

Sorry for the delay on follow up on this. I took a quick look at Aurora
code, and it's actually quite easy to pipe this information to Mesos (see
https://reviews.apache.org/r/45177/ for quick prototype).

I'll take a stab to see how I can get Mesos-DNS to work with this prototype.

IMO, if this is something the community is interested, the main questions
would be 1) how various fields would be mapped in different Aurora usages,
and 2) to which level should opt-in/opt-out configured for populating such
information.

I actually don't have too much insights on how these usage conventions
would be set (through command line of scheduler or job configuration?)

Do you think a design doc is the best action here, or a more involved
questionnaire about which fields would be useful for community, or what
value they should take?

On Mon, Mar 7, 2016 at 1:00 AM, Erb, Stephan <stephan@blue-yonder.com>
wrote:

> That sounds like a good idea! Great.
>
> If you go ahead with this, please be so kind and start by posting a short
> design document here on mailinglist (similar to those here
> https://github.com/apache/aurora/blob/master/docs/design-documents.md,
> but probably shorter).
>
> This will allow us to split the discussion of the design from discussing
> the actual implementation. I believe this is necessary, as the
> DiscoveryInfo protocol is quite flexible (
> http://mesos.apache.org/documentation/latest/app-framework-development-guide/
> ).
>
> Thanks,
> Stephan
>
>
> 
> From: Zhitao Li <zhitaoli...@gmail.com>
> Sent: Monday, March 7, 2016 00:05
> To: dev@aurora.apache.org
> Subject: Populate DiscoveryInfo in Mesos
>
> Hi,
>
> It seems like Aurora does not populate the "discovery" field in either
> TaskInfo or ExecutorInfo in mesos.proto
> <
> https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L438
> >
> .
>
> I'm considering adding this to support retrieving port map in Mesos
> directly. This would enable us to discovery this information directly from
> Mesos side, and also enables us to build one universal service discovery
> solution for multiple frameworks including Aurora.
>
> If no objection, I'll create a JIRA ticket for this task.
>
> Thanks.
> --
> Cheers,
>
> Zhitao Li
>



--
Cheers,

Zhitao Li


Re: [VOTE] Release Apache Aurora 0.12.0 rpms

2016-03-21 Thread Erb, Stephan
+1 

Verified using the install instructions here 
https://github.com/apache/aurora-packaging/blob/master/test/rpm/centos-7/README.md#released
 but with together with John's pkg_root.

I have noticed two things:
* our thermos_root variable does not seem to play well with the default 
work_dir of the slave. They use different defaults. However, this is not really 
a problem with the packaging rather than with the default configuration. Our 
install guide even covers the case.
* Our install guide (in the Aurora) is heavily outdated and a little bit 
confusing. Maybe we can merge this with the Vagrant test file to only have a 
single authority telling a user how to do the setup.

Packages in general are fine, so we can leave those two issues for another day.

From: John Sirois 
Sent: Monday, March 21, 2016 14:25
To: Bill Farner
Cc: dev@aurora.apache.org
Subject: Re: [VOTE] Release Apache Aurora 0.12.0 rpms

Folks - the rpm release for 0.12.0 still needs one more binding +1 to pass.

>From what I can tell, there is no need to create a new RC (there are no
identified bugs) and the only mention of time limits in the voting
guidelines [1] is to bound the voting window on the low side to give enough
time for folks to vote.  As such, unless someone is aware of conflicting
policy, I'll extend this vote until we get a binding +1 or any -1.

Please test if you can find the time!

[1] http://www.apache.org/foundation/voting.html

On Tue, Mar 15, 2016 at 10:19 AM, John Sirois  wrote:

> I'll be offline tomorrow through Saturday (3/16-3/19) and the vote closes
> Sunday 3/20 10am Mountain time so sending out a last reminder to test these
> rpms and vote if you have time between now and then.
>
>
> On Mon, Mar 14, 2016 at 2:22 PM, Bill Farner  wrote:
>
>> +1
>>
>> Verified using instructions starting here:
>> https://github.com/apache/aurora-packaging/blob/master/test/rpm/centos-7/README.md#released
>>
>> and
>> pkg_root="https://dl.bintray.com/john-sirois/aurora/centos-7/;
>>
>>
>> On Mon, Mar 14, 2016 at 12:10 PM, John Sirois  wrote:
>>
>>> I propose that we accept the following artifacts as the official rpm
>>> packaging
>>> for Apache Aurora 0.12.0.
>>>
>>> *https://dl.bintray.com/john-sirois/aurora/centos-7/
>>> *
>>>
>>> The Aurora rpm packaging includes the following:
>>> ---
>>> The CHANGELOG is viewable at:
>>> *
>>> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=log;h=refs/heads/0.12.x;hp=refs/heads/0.11.x
>>> <
>>> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=log;h=refs/heads/0.12.x;hp=refs/heads/0.11.x
>>> >*
>>>
>>> The branch used to create the packaging is:
>>> https://git1-us-west.apache.org/repos/asf?p=aurora
>>> -packaging.git;a=tree;h=refs/heads/0.12.x
>>>
>>> The packages are available at:
>>> *https://dl.bintray.com/john-sirois/aurora/centos-7/
>>> *
>>>
>>> The GPG keys used to sign the packages are available at:
>>> https://dist.apache.org/repos/dist/release/aurora/KEYS
>>>
>>> Please download, verify, and test.
>>>
>>> The vote will close on Sun, 20 Mar 2016 10:00:00 -0700
>>>
>>> [ ] +1 Release these as the rpm packages for Apache Aurora 0.12.0
>>> [ ] +0
>>> [ ] -1 Do not release these artifacts because...
>>> ---
>>>
>>> Please consider verifying these rpms using the install guide:
>>> https://github.com/apache/aurora/blob/master/docs/installing.md
>>>
>>> Or by using the test guide for rpms:
>>>
>>> https://github.com/apache/aurora-packaging/blob/master/test/rpm/centos-7/README.md
>>> <
>>> https://github.com/apache/aurora-packaging/blob/master/test/rpm/centos-7/README.md
>>> >
>>>
>>>
>>>
>>> I'd like to kick off voting with my own +1
>>>
>>
>>
>

Re: [PROPOSAL] Supporting Mesos Universal Containers

2016-03-15 Thread Erb, Stephan
> Does anyone think we need a stop-the-world moment to come up with a long
> term, holistic plan, or is it reasonable to assess the situation as we go?

FWIW, I am ok with moving along and assessing the situation on the fly.

We cannot tell right now when the unified containerizer is rock-solid, so a 
couple of improvement patches for the native Docker support will probably do 
more good than harm. We just have to keep an eye on what's happening in Mesos 
itself.

Regards,
Stephan


From: Joshua Cohen <jco...@apache.org>
Sent: Tuesday, March 15, 2016 15:55
To: dev@aurora.apache.org
Subject: Re: [PROPOSAL] Supporting Mesos Universal Containers

I've gone ahead and filed some tickets breaking down the work involved in
this effort. They're all contained within this epic:
https://issues.apache.org/jira/browse/AURORA-1634.

I agree that we should assess our plans re: containers as a whole. My
understanding of the current state of the world is as follows:


   1. There are definite benefits to adopting the unified containerizer to
   launch image-based tasks (outlined in the design doc, but I'll reiterate
   here: better interaction w/ Mesos isolators, no need to rely on external
   daemons to coordinate containers, etc.).
   2. There are also considerations to keep in mind, specifically, we now
   rely on Mesos maintaining this image support, which is secondary to its
   primary goals, rather than relying on organizations that are invested in
   the formats. Additionally, we'll have to wait on Mesos to implement new
   features of image formats before they can be adopted by Aurora users.

I don't believe that number 2 above should be a blocker to number 1, but
it's a caveat that we must always keep in mind. I'll point out that the
design doc takes a very cautious approach towards deprecating the current
Docker support. It may be that we maintain both in perpetuity. It's also
possible, and I'm hopeful that this is the case, that the Mesos community
will show responsible custodianship of the unified container support,
allaying the concerns outlined above and allowing us to deprecate the
current native-Docker containerizer support. In either case described here,
I think Aurora users will benefit.

Does anyone think we need a stop-the-world moment to come up with a long
term, holistic plan, or is it reasonable to assess the situation as we go?


On Sun, Mar 13, 2016 at 7:23 AM, Erb, Stephan <stephan@blue-yonder.com>
wrote:

> As mentioned in IRC, I like the proposal.
>
> Still, we need a discussion regarding the future of current Docker
> support. Especially since Bill and John have now started improving it. What
> are our plans here? What  are the plans of the Mesos community (i.e.,
> deprecation of the docker containerizer)?
>
> In addition, the switch implemented for disabling Thermos when running
> Docker kind of reminded me of
> https://issues.apache.org/jira/browse/AURORA-1288. It is probably
> worthwhile to at least assess this as a whole.
>
> Kind Regards,
> Stephan
>
> 
> From: Joshua Cohen <jco...@apache.org>
> Sent: Monday, March 7, 2016 20:58
> To: dev@aurora.apache.org
> Subject: [PROPOSAL] Supporting Mesos Universal Containers
>
> Hi all,
>
> I'd like to propose we adopt the Mesos universal container support for
> provisioning tasks from both Docker and AppC images. Please review the doc
> below and let me know what you think.
>
>
> https://docs.google.com/document/d/111T09NBF2zjjl7HE95xglsDpRdKoZqhCRM5hHmOfTLA/edit?usp=sharing
>
> Thanks!
>
> Joshua
>

Re: [PROPOSAL] Supporting Mesos Universal Containers

2016-03-13 Thread Erb, Stephan
As mentioned in IRC, I like the proposal.

Still, we need a discussion regarding the future of current Docker support. 
Especially since Bill and John have now started improving it. What are our 
plans here? What  are the plans of the Mesos community (i.e., deprecation of 
the docker containerizer)?

In addition, the switch implemented for disabling Thermos when running Docker 
kind of reminded me of https://issues.apache.org/jira/browse/AURORA-1288. It is 
probably worthwhile to at least assess this as a whole.

Kind Regards,
Stephan


From: Joshua Cohen 
Sent: Monday, March 7, 2016 20:58
To: dev@aurora.apache.org
Subject: [PROPOSAL] Supporting Mesos Universal Containers

Hi all,

I'd like to propose we adopt the Mesos universal container support for
provisioning tasks from both Docker and AppC images. Please review the doc
below and let me know what you think.

https://docs.google.com/document/d/111T09NBF2zjjl7HE95xglsDpRdKoZqhCRM5hHmOfTLA/edit?usp=sharing

Thanks!

Joshua

Re: [VOTE] Release Apache Aurora 0.12.0 rpms

2016-03-12 Thread Erb, Stephan
I'll try to spin up a centos box myself and see how that goes.

> Dependency naming aside, i think we should omit docker from our
> dependencies, as it really should be a mesos dep if anything.  *I can send
> a patch for that if others agree.*

We also have to be more diligent in tracking the Mesos version we depend on. It 
already got out of date here [1] and here [2].

[1] 
https://github.com/apache/aurora/blob/master/docs/installing.md#mesos-on-centos-7
[2] 
https://github.com/apache/aurora-packaging/blob/master/specs/rpm/aurora.spec#L41



From: Bill Farner 
Sent: Saturday, March 12, 2016 21:34
To: dev@aurora.apache.org; jsir...@apache.org
Subject: Re: [VOTE] Release Apache Aurora 0.12.0 rpms

-1

I'm had trouble getting these to work.  I used the vagrant environment
here:
https://github.com/apache/aurora-packaging/tree/master/test/rpm/centos-7

*Executor:*
$ sudo rpm -ivh aurora-executor-0.12.0-1.el7.centos.aurora.x86_64.rpm
error: Failed dependencies:
docker is needed by aurora-executor-0.12.0-1.el7.centos.aurora.x86_64

Apparently the official docker package is called docker-engine
https://docs.docker.com/engine/installation/linux/centos/

Dependency naming aside, i think we should omit docker from our
dependencies, as it really should be a mesos dep if anything.  *I can send
a patch for that if others agree.*

*Scheduler:*
I had trouble getting the scheduler to start, it exits due to an uncaught
exception in the main thread, and unfortunately a stack trace doesn't turn
up in journalctl.  We need to figure out why the errors don't show up,
possibly in conjunction with addressing the items below.

Doing some investigation, i noticed something strange - JAVA_OPTS (set in
/etc/sysconfig/aurora) doesn't make it to the process launched by systemd.
It seems to be discarded when /usr/bin/aurora-scheduler-startup calls
/usr/lib/aurora/bin/aurora-scheduler.  Other variables (e.g.
AURORA_SCHEDULER_OPTS) propagate fine.  I've probably been staring at this
too long and am missing something obvious, but i'm not making sense of it.

Sidestepping the above issue, i discovered 2 reasons the scheduler won't
start up:
- Default backup dir /var/lib/aurora/scheduler/backups does not exist,
insufficient permission to create

- Fails to load the mesos native lib
aurora-scheduler-startup[8500]: Failed to load native Mesos library from
/usr/lib;/usr/lib64
I was able to fix this by removing ;/usr/lib64 from
-Djava.library.path='/usr/lib;/usr/lib64', alternatively by removing the
library.path setting and exporting LD_LIBRARY_PATH=/usr/lib.

Happy to pitch in on fixing these issues, curious what folks think of the
items above, especially the JAVA_OPTS issue.



On Fri, Mar 11, 2016 at 1:31 PM, John Sirois  wrote:

> Pinging this VOTE and noting that the close is Monday at 11am Mountain
> time.
>
> Please test!
>
> On Wed, Mar 9, 2016 at 11:03 AM, John Sirois  wrote:
>
> >
> >
> > On Wed, Mar 9, 2016 at 11:00 AM, John Sirois  wrote:
> >
> >> I propose that we accept the following artifacts as the official rpm
> packaging
> >> for Apache Aurora 0.12.0.
> >>
> >> *https://dl.bintray.com/john-sirois/aurora/centos-7/
> >> *
> >>
> >> The Aurora rpm packaging includes the following:
> >> ---
> >> The CHANGELOG is viewable at:
> >> *
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=log;h=refs/heads/0.12.x;hp=refs/heads/0.11.x
> >> <
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=log;h=refs/heads/0.12.x;hp=refs/heads/0.11.x
> >*
> >>
> >> The branch used to create the packaging is:
> >>
> >>
> https://git1-us-west.apache.org/repos/asf?p=aurora-packaging.git;a=tree;h=refs/heads/0.12.x
> >>
> >> The packages are available at:
> >> *https://dl.bintray.com/john-sirois/aurora/centos-7/
> >> *
> >>
> >> The GPG keys used to sign the packages are available at:
> >> https://dist.apache.org/repos/dist/release/aurora/KEYS
> >>
> >> Please download, verify, and test.
> >>
> >> The vote will close on Mon, 14 Mar 2016 11:00:00 -0700
> >>
> >> [ ] +1 Release these as the deb packages for Apache Aurora 0.12.0
> >>
> >
> > Correction - "Release these as the rpm packages for Apache Aurora 0.12.0"
> >
> > [ ] +0
> >> [ ] -1 Do not release these artifacts because...
> >> ---
> >>
> >>
> > And again, copypasta - "Please consider verifying these rpms using the
> > install guide:"
> >
> > Please consider verifying these debs using the install guide:
> >>   https://github.com/apache/aurora/blob/master/docs/installing.md
> >>
> >>
> >> I'd like to kick off voting with my own +1
> >>
> >
> >
>

Re: Populate DiscoveryInfo in Mesos

2016-03-07 Thread Erb, Stephan
That sounds like a good idea! Great.

If you go ahead with this, please be so kind and start by posting a short 
design document here on mailinglist (similar to those here 
https://github.com/apache/aurora/blob/master/docs/design-documents.md, but 
probably shorter). 

This will allow us to split the discussion of the design from discussing the 
actual implementation. I believe this is necessary, as the DiscoveryInfo 
protocol is quite flexible 
(http://mesos.apache.org/documentation/latest/app-framework-development-guide/).

Thanks,
Stephan



From: Zhitao Li 
Sent: Monday, March 7, 2016 00:05
To: dev@aurora.apache.org
Subject: Populate DiscoveryInfo in Mesos

Hi,

It seems like Aurora does not populate the "discovery" field in either
TaskInfo or ExecutorInfo in mesos.proto

.

I'm considering adding this to support retrieving port map in Mesos
directly. This would enable us to discovery this information directly from
Mesos side, and also enables us to build one universal service discovery
solution for multiple frameworks including Aurora.

If no objection, I'll create a JIRA ticket for this task.

Thanks.
--
Cheers,

Zhitao Li

Re: [DRAFT] [REPORT] Apache Aurora

2016-03-01 Thread Erb, Stephan
+1

From: Dave Lester 
Sent: Tuesday, March 1, 2016 07:15
To: dev@aurora.apache.org
Subject: Re: [DRAFT] [REPORT] Apache Aurora

+1

On Mon, Feb 29, 2016, at 05:27 PM, Jake Farrell wrote:
> +1 looks good
>
> -Jake
>
> On Mon, Feb 29, 2016 at 8:15 PM, Bill Farner  wrote:
>
> > Please take a moment to read through a draft of the board report i have
> > prepared for Aurora.  I will plan to submit mid-week, so please let me know
> > if you would like to see any edits/additions!
> >
> > ## Description:
> >   Apache Aurora lets you use an Apache Mesos cluster as a private cloud. It
> >   supports running long-running services, cron jobs, and ad-hoc jobs.
> >
> > ## Issues:
> >  There are no issues requiring board attention at this time.
> >
> > ## Activity:
> >  - Significant number of contributors in recent 0.11.0 and 0.12.0 releases.
> >  - Positive feedback from prebuilt binary packages (including nightlies),
> >signs of many users taking advantage of this to build their own.
> >
> > ## Health report:
> >  Aurora has been under active development for this period, with strong
> >  community engagement.  We have placed deliberate effort to build release
> >  notes as we iterate towards future releases, which has received positive
> >  feedback from users.
> >
> > ## PMC changes:
> >  - Currently 17 PMC members.
> >  - New PMC members:
> > - Joshua Cohen was added to the PMC on Tue Dec 22 2015
> > - John Sirois was added to the PMC on Sun Jan 03 2016
> > - Stephan Erb was added to the PMC on Wed Feb 03 2016
> > - Steve Niemitz was added to the PMC on Mon Jan 11 2016
> >
> > ## Committer base changes:
> >  - Currently 18 committers.
> >  - New commmitters:
> > - John Sirois was added as a committer on Mon Jan 04 2016
> > - Stephan Erb was added as a committer on Wed Feb 03 2016
> > - Steve Niemitz was added as a committer on Tue Jan 12 2016
> >
> > ## Releases:
> >  - 0.11.0 was released on Wed Dec 23 2015
> >  - 0.12.0 was released on Sun Feb 07 2016
> >
> > ## JIRA activity:
> >  - 79 JIRA tickets created in the last 3 months
> >  - 185 JIRA tickets closed/resolved in the last 3 months
> >


Weekly community meeting

2016-02-28 Thread Erb, Stephan
Hi everyone,

seems like we have been sloppy with the community meeting in the last weeks. It 
doesn't feel right to have a regular meeting that is skipped silently.

Any thoughts or ideas what we could do about that? 

Best Regards,
Stephan

Re: [RESULT][VOTE] Release Apache Aurora 0.12.0 RC4

2016-02-28 Thread Erb, Stephan
Even though we have done the voting, the release is still pending. We still 
have to build the packages and update the website.

Is there a way we can help out here?

Best,
Stephan

From: John Sirois 
Sent: Monday, February 8, 2016 23:47
To: dev@aurora.apache.org
Subject: [RESULT][VOTE] Release Apache Aurora 0.12.0 RC4

All,
The vote to accept Apache Aurora 0.12.0 RC4
as the official Apache Aurora 0.12.0 release has passed.


+1 (Binding)
--
jsirois
wfarner
serb
maxim


+1 (Non-binding)
--


There were no 0 or -1 votes. Thank you to all who helped make this release.


Aurora 0.12.0 includes the following:
---
The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=aurora.git=CHANGELOG=rel/0.12.0

The tag used to create the release with is rel/0.12.0:
https://git-wip-us.apache.org/repos/asf?p=aurora.git=rel/0.12.0

The release is available at:
https://dist.apache.org/repos/dist/release/aurora/0.12.0/apache-aurora-0.12.0.tar.gz

The MD5 checksum of the release can be found at:
https://dist.apache.org/repos/dist/release/aurora/0.12.0/apache-aurora-0.12.0.tar.gz.md5

The signature of the release can be found at:
https://dist.apache.org/repos/dist/release/aurora/0.12.0/apache-aurora-0.12.0.asc

The GPG key used to sign the release are available at:
https://dist.apache.org/repos/dist/release/aurora/KEYS


On Fri, Feb 5, 2016 at 3:14 PM, John Sirois  wrote:

> All,
>
> I propose that we accept the following release candidate as the official
> Apache Aurora 0.12.0 release.
>
> Aurora 0.12.0-rc4 includes the following:
> ---
> The NEWS for the release is available at:
>
> https://git-wip-us.apache.org/repos/asf?p=aurora.git=NEWS=rel/0.12.0-rc4
>
> The CHANGELOG for the release is available at:
>
> https://git-wip-us.apache.org/repos/asf?p=aurora.git=CHANGELOG=rel/0.12.0-rc4
>
> The tag used to create the release candidate is:
>
> https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/tags/rel/0.12.0-rc4
>
> The release candidate is available at:
>
> https://dist.apache.org/repos/dist/dev/aurora/0.12.0-rc4/apache-aurora-0.12.0-rc4.tar.gz
>
> The MD5 checksum of the release candidate can be found at:
>
> https://dist.apache.org/repos/dist/dev/aurora/0.12.0-rc4/apache-aurora-0.12.0-rc4.tar.gz.md5
>
> The signature of the release candidate can be found at:
>
> https://dist.apache.org/repos/dist/dev/aurora/0.12.0-rc4/apache-aurora-0.12.0-rc4.tar.gz.asc
>
> The GPG key used to sign the release are available at:
> https://dist.apache.org/repos/dist/dev/aurora/KEYS
>
> Please download, verify, and test.
>
> The vote will close on Mon Feb  8 15:12:45 MST 2016
>
> [ ] +1 Release this as Apache Aurora 0.12.0
> [ ] +0
> [ ] -1 Do not release this as Apache Aurora 0.12.0 because...
> ---
>
> Reminder: you can verify the release candidate via:
>
>   ./build-support/release/verify-release-candidate 0.12.0-rc4
>
> If you can deploy the RC to a test cluster and evaluate it there, even
> better.
>


Re: [PROPOSAL] Disallow instance removal in job update

2016-02-07 Thread Erb, Stephan
A related idea that recently crossed my mind was some kind of pystachio 
variable / binding helper:  {{aurora.instances}}.

When updating a job, the scheduler would fill in the current instance count. 
However, when I want to change the number of instances, I could simply bind 
another value locally when triggering the update.

From: Maxim Khutornenko 
Sent: Saturday, February 6, 2016 00:07
To: dev@aurora.apache.org
Subject: Re: [PROPOSAL] Disallow instance removal in job update

We have had attempts to safeguard client updater command with a
"dangerous change" warning before but it did not get good feedback.
Besides, automated tools/scripts just ignored it.

An alternative could be what George suggest on the scaling API thread
mentioned earlier: automatically bump up instance count to the job
active task count. I'd say this could be an implementation to the
proposal above rather than a safeguard as it accomplishes the exact
same goal.

Bill, do you have any ideas of what that safeguard could be?

On Fri, Feb 5, 2016 at 2:56 PM, Bill Farner  wrote:
>>
>> the outdated instance count problem will only get worse as automated
>> scaling tools will quickly render existing .aurora config value obsolete
>
>
> This is not a compelling reason to remove functionality.  Sounds like a
> safeguard is needed instead.
>
> On Fri, Feb 5, 2016 at 2:43 PM, Maxim Khutornenko  wrote:
>
>> This is mostly a survey rather than a proposal. How would people think
>> about limiting updater to only adding/updating instances and let
>> killTasks take care of instance removals?
>>
>> We have all heard stories (or happen to create some ourselves) when an
>> outdated instance count value in .aurora config caused unexpected
>> instance removals. Granted, there are plenty of other values in the
>> config that can cause service-wide outage but instance count seems to
>> be the worst in that sense.
>>
>> After the recent refactoring of addInstances and killTasks to act as
>> scaleOut/scaleIn APIs [1], the outdated instance count problem will
>> only get worse as automated scaling tools will quickly render existing
>> .aurora config value obsolete. With that in mind, should we block
>> instance removal in the updater and let an explicit killTasks call be
>> the only acceptable action to reduce instance count? Is there any
>> value (aside from arguable convenience factor) in having
>> startJobUpdate ever killing instances?
>>
>> Thanks,
>> Maxim
>>
>> [1] - http://markmail.org/message/2smaej5n5e54li3g
>>


Re: Further thoughts on config deprecations

2016-02-03 Thread Erb, Stephan
Are you suggesting a tool that operates against a running Aurora cluster and 
performs a serverside inspection? Or are you implying a tool that works on 
.aurora files? 

I'd find the first one way more useful, as the latter one would suggest that 
you had to have a monorepo with access to the all aurora configurations.

Cheers,
Stephan

From: John Sirois 
Sent: Thursday, February 4, 2016 2:42 AM
To: dev@aurora.apache.org
Subject: Re: Further thoughts on config deprecations

On Wed, Feb 3, 2016 at 6:34 PM, Joshua Cohen  wrote:

> How would folks feel about requiring any changes that deprecate job config
> to include some sort of codemod[1]-like patch that would allow cluster
> operators to automatically fix deprecated fields across their company's
> Aurora configs?
>
> We could either leverage codemod directly, or create nicer tooling around
> it that supports applying arbitrary patches (along the lines of
> jscodeshift[2]).
>

I'd be really happy with a less ambitious step - a tool folks could run
that gave them warnings about things that were broken between their current
release and the targeted release - leaving fixes up to them, though
suggesting those fixes in console text - possibly with links out to longer
articles on the web for tougher fixes.


>
> Cheers,
>
> Joshua
>
> [1] https://github.com/facebook/codemod
> [2] https://github.com/facebook/jscodeshift
>



--
John Sirois
303-512-3301


Re: NEWS Layout

2016-02-02 Thread Erb, Stephan
Thanks for the quick responses! Now on master: 
https://reviews.apache.org/r/43109/ 

From: Zameer Manji <zma...@apache.org>
Sent: Tuesday, February 2, 2016 8:35 PM
To: dev@aurora.apache.org
Subject: Re: NEWS Layout

+1

On Tue, Feb 2, 2016 at 11:32 AM, John Sirois <j...@conductant.com> wrote:

> +1
>
> On Tue, Feb 2, 2016 at 12:29 PM, Joshua Cohen <jco...@apache.org> wrote:
>
> > +1
> >
> > On Tue, Feb 2, 2016 at 1:09 PM, Bill Farner <wfar...@apache.org> wrote:
> >
> > > +1
> > >
> > > On Tuesday, February 2, 2016, Erb, Stephan <
> stephan@blue-yonder.com>
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > I'd like to propose that we give our NEWS file a little bit more
> > > > structure. Currently, it is quite cluttered [1].
> > > >
> > > > To keep it simple, I'd suggest that we adopt the style from the 0.11
> > > > Aurora blog post:
> > > >
> > > > * New/updated
> > > > * Deprecations and removals
> > > >
> > > > [1] https://github.com/apache/aurora/blob/master/NEWS
> > > > [2] https://aurora.apache.org/blog/aurora-0-11-0-released/
> > > >
> > > >
> > > > Thoughts?
> > > >
> > >
> >
>
>
>
> --
> John Sirois
> 303-512-3301
>
> --
> Zameer Manji
>
> <303-512-3301>

Job-Aggregation

2016-01-26 Thread Erb, Stephan
Hi,

I've noticed that a couple of people [1, 2] have independently talked about 
aggregating multiple Aurora jobs to  'logical services'. Internally, we also do 
something similar. I am wondering if there is a broader concept waiting to be 
discovered as an Aurora feature. As a kind of related concept, Marathon does 
have its application groups [3] and Kubernetes its pods [4].

Does anyone have thoughts on this?   That's a pretty open ended question here, 
but would be unfortunate if many Aurora users end up building a similar thing 
independently. 

Regards,
Stephan


[1] 
https://mail-archives.apache.org/mod_mbox/aurora-user/201512.mbox/%3CCABwOPbetX_pA%3Dq8fUCWHB%2B9R1JdD8V4m_S6T2AVDvmwn36t4Cg%40mail.gmail.com%3E
[2] 
https://issues.apache.org/jira/browse/AURORA-1052?focusedCommentId=15108936=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15108936
[3] https://mesosphere.github.io/marathon/docs/application-groups.html
[4] http://kubernetes.io/v1.0/docs/user-guide/pods.html

Re: [PROPOSAL] Revisit task ID format

2016-01-26 Thread Erb, Stephan
Additional thought: I guess if we use the unmangled jobkey as a task name [1] 
we could totally go for the pure UUID task ids.

[1] 
https://github.com/apache/aurora/blob/749f83502f059ae6d2b229cf76c1ed44ccf3d255/src/main/java/org/apache/aurora/scheduler/configuration/executor/ExecutorModule.java#L135



From: Erb, Stephan <stephan@blue-yonder.com>
Sent: Wednesday, January 27, 2016 12:17 AM
To: dev@aurora.apache.org
Subject: Re: [PROPOSAL] Revisit task ID format

+1 for dropping the timestamp

However, I am not sure regarding the mangled jobkey. It tends to make it easier 
to correlate Mesos tasks to Aurora jobs when skimming log files, viewing the 
Mesos-UI or even when using the Thermos [1]. I guess the traceability of all of 
those usecases could be improved, but that would probably additional work.

[1] https://github.com/apache/aurora/blob/master/docs/images/runningtask.png

From: Bill Farner <wfar...@apache.org>
Sent: Wednesday, January 27, 2016 12:03 AM
To: dev@aurora.apache.org
Subject: [PROPOSAL] Revisit task ID format

Context: a task ID is a unique identifier for a task.  Aurora and Mesos
both require this uniqueness.  Within mesos, frameworks are required to
craft their own task IDs as they see fit.

Our task ID format is currently [1]

TIMESTAMP-ROLE-ENV-JOBNAME-INSTANCE-UUID


for an example:

1453847837931-vagrant-test-http_example_docker-1-a23f55e2-151f-409e-9cea-76ec79482533


In addition to guaranteed uniqueness, this format has the benefit of being
somewhat human-friendly.  Within logs, it is somewhat possible to partially
recognize a task based solely on the text ID.

*I would like to propose that we remove the TIMESTAMP- prefix from the task
ID.*  It was originally included so that task IDs would be temporally
sortable for scheduling prioritization.  At present, tasks are not sorted
using the ID.

While proposing the above, i think it's also prudent to take the
opportunity to consider a complete overhaul of the ID contents.  *An
alternative approach would be to only use the UUID.*  This has the benefit
of decoupling arbitrary user input from the various ways task IDs are used
(as an example - mesos uses them in file names, limiting length and allowed
characters).  Task IDs also become fixed width, which offers a (very)
marginal memory reduction over the status quo, and makes console line
wrapping more consistent when perusing logs.  Additionally, it eschews the
potential problem of users parsing task IDs and coupling to its format.

Any thoughts on this?


[1]
https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/TaskIdGenerator.java

Re: Job-Aggregation

2016-01-26 Thread Erb, Stephan
I guess most of the time it is just the viewing, and then knowing that 
everything that is supposed to be there is healthy in the Aurora-sense. That's 
a pretty loose notion of working, but tends to be quite helpful.

From: Bill Farner <wfar...@apache.org>
Sent: Wednesday, January 27, 2016 12:53 AM
To: dev@aurora.apache.org
Subject: Re: Job-Aggregation

Are there any specific things you be looking to do with these groups, or
just view them as a logical collection?

On Tue, Jan 26, 2016 at 3:01 PM, Erb, Stephan <stephan@blue-yonder.com>
wrote:

> Hi,
>
> I've noticed that a couple of people [1, 2] have independently talked
> about aggregating multiple Aurora jobs to  'logical services'. Internally,
> we also do something similar. I am wondering if there is a broader concept
> waiting to be discovered as an Aurora feature. As a kind of related
> concept, Marathon does have its application groups [3] and Kubernetes its
> pods [4].
>
> Does anyone have thoughts on this?   That's a pretty open ended question
> here, but would be unfortunate if many Aurora users end up building a
> similar thing independently.
>
> Regards,
> Stephan
>
>
> [1]
> https://mail-archives.apache.org/mod_mbox/aurora-user/201512.mbox/%3CCABwOPbetX_pA%3Dq8fUCWHB%2B9R1JdD8V4m_S6T2AVDvmwn36t4Cg%40mail.gmail.com%3E
> [2]
> https://issues.apache.org/jira/browse/AURORA-1052?focusedCommentId=15108936=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15108936
> [3] https://mesosphere.github.io/marathon/docs/application-groups.html
> [4] http://kubernetes.io/v1.0/docs/user-guide/pods.html


Re: [PROPOSAL] Amend 0.12.0 release goals

2016-01-14 Thread Erb, Stephan
+1 for catching up

From: John Sirois 
Sent: Thursday, January 14, 2016 4:18 PM
To: dev@aurora.apache.org
Subject: Re: [PROPOSAL] Amend 0.12.0 release goals

On Thu, Jan 14, 2016 at 8:02 AM, Bill Farner  wrote:

> Given that we are still playing catch-up to mesos releases (we are on
> 0.25.0, latest is 0.26.0, there's talk of cutting 0.27.0 soon), i would
> like to suggest that we remove these tickets from 0.12.0:
>
> https://issues.apache.org/jira/browse/AURORA-987
> https://issues.apache.org/jira/browse/AURORA-1150
>
> It's slightly unfortunate as these tickets represent the largest planned
> effort for 0.12.0, we had a handful of significant contributions that still
> make this a featureful release:
>
>
> https://github.com/apache/aurora/blob/d542bd1d58bc5dcf6ead95d902c0a8cecbbffe9e/NEWS#L3-L23
>
> Additionally, i would like to recommend that we add the following ticket to
> 0.12.0 since we have an in-fight patch that looks close to completion:
>
> https://issues.apache.org/jira/browse/AURORA-1109


+1 to the proposal, catching up would be good to do.

I clarified the status of AURORA-987 by stopping progress and
un-assigning.  Although I'm working towards that ticket, its by very
indirect means still and I'll re-assign and re-start once I'm actually
engaged in the meat of the API proposal that is needed assuming no one else
has dived in.


>
>
>
> Cheers,
>
> Bill
>



--
John Sirois
303-512-3301


Re: AURORA-1440 Evaluate Fenzo scheduling library

2015-12-29 Thread Erb, Stephan
Someone also expressed interest in Fenzo by adding it to the community-driven 
roadmap [1].

AFAIK nobody has looked at in in detail, yet. Or at least nobody has posted 
about it on the mailinglist. Feel free to be that someone and take a closer 
look at what would be necessary to leverage the power of Fenzo in Aurora :-)

Without checking, I would assume that the following areas might need the most 
effort to address:
* the preemption mechanism of Aurora to make room for priority/production jobs
* the work-in-progress feature using oversubscribed resources [2]

Regards,
Stephan

[1] 
https://docs.google.com/document/d/1vyhTZSlEPeibQm2_7HK6JXOkydO0ZllZNQZ2O3cC4_0/edit?usp=sharing
 
[2] 
https://docs.google.com/document/d/1r1WCHgmPJp5wbrqSZLsgtxPNj3sULfHrSFmxp2GyPTo/edit?pref=2=1#heading=h.af56b6bntcao


From: Mauricio Garavaglia 
Sent: Tuesday, December 29, 2015 6:02 AM
To: dev@aurora.apache.org
Subject: AURORA-1440 Evaluate Fenzo scheduling library

Hello,

I found the issue AURORA-1440
 about evaluating fenzo
to replace the scheduling algorithm. Has there been any progress on it? is
the intention just replace the implementation or also expose part of the
configurations that fenzo allows?

It would be handy to be able to select, for example, between cpu bin
packing and memory bin packing. But also take advantage of the rich
scheduling constraints api that it has. I assume a lot of discussion would
be needed regarding how to expose them though :)
Thanks


Mauricio

Re: Ticket cleanup

2015-12-28 Thread Erb, Stephan
+1. Having a well-groomed bug tracker is very helpful for everyone involved.

In particular, it would be great if we could get the bug count to 0 over the 
course of the next months. Either bugs are important and we get them fixed, or 
we have to guts to close them as won't fix and update the documentation 
accordingly.

Best Regards,
Stephan



From: Bill Farner 
Sent: Monday, December 28, 2015 4:44 PM
To: dev@aurora.apache.org
Subject: Ticket cleanup

I'd like to share some rationale on my JIRA activity last night.  I'm happy
to undo any of the changes if folks disagree.

We had approximately 450 open tickets prior to last night.  Personally, I
found it daunting to find things we should actually work on.  To remedy
this, I started by skimming through tickets that have not been touched in
the last 6 months.  This quickly identified a swath of tickets that may be
valid, but I did not imagine them being valuable enough to address in the
foreseeable future.

If there is interest in doing more of this, I welcome help from others to
continue reducing the queue to something more manageable.  My culling
reduced the queue but it is still very large.

If you do participate in this, please try to avoid using "Resolved, Fixed"
so as to not add noise to the changelog in the next release.

Re: [VOTE] Release Apache Aurora 0.11.0 RC1

2015-12-20 Thread Erb, Stephan
What's up with this ticket here: 
https://issues.apache.org/jira/browse/AURORA-1520 

Was this forgotten? Should we do it now?

Regards,
Stephan

From: John Sirois 
Sent: Friday, December 18, 2015 3:37 AM
To: dev@aurora.apache.org
Subject: Re: [VOTE] Release Apache Aurora 0.11.0 RC1

On Thu, Dec 17, 2015 at 5:17 PM, Bill Farner  wrote:

> Friendly reminder that verifying a release can be as easy as
>
>   ./build-support/release/verify-release-candidate  0.11.0-rc1
>
> Of course, if you have a simulated production environment, we would love to
> hear how this build behaves there!
>
> On Thu, Dec 17, 2015 at 4:08 PM, Bill Farner  wrote:
>
> > All,
> >
> > I propose that we accept the following release candidate as the official
> > Apache Aurora 0.11.0 release.
>

+1 non-binding

Verified on Arch Linux - kernel 4.2.5 + OpenJDK 1.8.0_66 + Python 2.7.11

>
> > Aurora 0.11.0-rc1 includes the following:
> > ---
> > The NEWS for this release is available at:
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git=NEWS=0.11.0-rc1
> >
> > The CHANGELOG for the release is available at:
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git=CHANGELOG=0.11.0-rc1
> >
> > The branch used to create the release candidate is:
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=shortlog;h=refs/heads/0.11.0-rc1
> >
> > The release candidate is available at:
> >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.11.0-rc1/apache-aurora-0.11.0-rc1.tar.gz
> >
> > The MD5 checksum of the release candidate can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.11.0-rc1/apache-aurora-0.11.0-rc1.tar.gz.md5
> >
> > The signature of the release candidate can be found at:
> >
> >
> https://dist.apache.org/repos/dist/dev/aurora/0.11.0-rc1/apache-aurora-0.11.0-rc1.tar.gz.asc
> >
> > The GPG key used to sign the release are available at:
> > https://dist.apache.org/repos/dist/dev/aurora/KEYS
> >
> > Please download, verify, and test.
> >
> > The vote will close on Tue Dec 22 15:18:55 PST 2015
> >
> > [ ] +1 Release this as Apache Aurora 0.11.0
> > [ ] +0
> > [ ] -1 Do not release this as Apache Aurora 0.11.0 because...
> >
>



--
John Sirois
303-512-3301

Re: Website fixes

2015-12-13 Thread Erb, Stephan
Looks great, thanks for that!

Only issue I have came across: Images on the presentation list are broken 
https://aurora.apache.org/documentation/latest/presentations/ 

Regards,
Stephan

From: Bill Farner 
Sent: Saturday, December 12, 2015 11:20 PM
To: dev@aurora.apache.org
Subject: Website fixes

Just an FYI that i spent some time this weekend cleaning up a few rough
edges on the website.

Noteworthy changes (may be difficult to notice without the original for
comparison):
- documentation : fixed
the majority of broken links and anchor tags.  trying to maintain perfect
parity with GitHub's markdown is still an uphill battle, but it's much
improved.
- downloads : binary distributions
are now linked, added links to release notes (blog posts)
- blog : tweaked the style and enabled
summaries to make the page less sparse

Also, for committers - hopefully building the website is now less brittle.
I've added a Vagrantfile that manages all the system dependencies for the
rake/middleman setup.  Feedback is welcome!

Re: Questions about Aurora scheduling policy

2015-11-25 Thread Erb, Stephan
Hi Riccardo,

have you looked into the quota feature and the updater settings? I have the 
impression that a combination of both is what you are looking for.

Regarding your requirement to have no resource constraints, have you considered 
starting a certain set of Mesos slaves defaulting to the posix isolators for 
CPU and RAM? Those posix isolators don't constraint your task at all, i.e., you 
can always use as much memory as you want or is physically available on the 
host.

Regards,
Stephan

From: Riccardo Poggi 
Sent: Tuesday, November 24, 2015 9:14 PM
To: dev@aurora.apache.org
Subject: Re: Questions about Aurora scheduling policy

Thanks Bill,

> I will answer your questions directly, but it may
> also make sense to have a higher-level discussion about your requirements
> to possibly offer alternative approaches.
Sure, that sounds very interesting.

The main scheduling constraint in our system is that it cannot tolerate
an undefined pending period, it needs at least some fast-feedback.

Example scenario: the system asks to launch N instances of a given
process with a specified set of resource constraints, now the scheduler
should either
- start them all successfully
- start only a part of them
 + because of resource saturation for example
- fail to start them
 + because those resources are not at all available in the farm, or
they are effectively busy, or ...
without queuing, but just returning the operation result.


There are also other, more functional, requirements. The most
interesting ones probably are:
* Hooks. For good integration with the infrastructure, like publishing
of information, access management control, and so on...
* Fault tolerance and Error recovery. A failure in either the scheduler
or executor shall not take down the managed processes and shall recover
it after a restart for example.
* Ownership. Possibility to set uid owner of the underlying processes.
* Notification system. Provide informations about the jobs/processes
status (both in pull or push mode)


> 3. Would it possible to have Aurora handle tasks with no resource
>> constraints?
> [...] but the real question is:
> what behavior do you want from scheduling without accounting?
Yes, that is indeed the question. What would be the default scheduling
behaviour in case of resource abundance?
Would it be possible to impose a "spreading" constraint on a subset of
the farm that a-priori I know is able to handle a defined set of jobs?


Cheers,
Riccardo


On 11/24/2015 06:38 AM, Bill Farner wrote:
> Welcome, happy to help!  I will answer your questions directly, but it may
> also make sense to have a higher-level discussion about your requirements
> to possibly offer alternative approaches.
>
> 1. Would it possible to subscribe to state change for a given job/task and
>> receive notifications?
>
> Not today, but i'm very open to the idea.  I think there are some cool
> things you could implement with this behavior.  A concern i often have,
> though, is that consumers of this data cannot handle cases where an event
> fails to be delivered (i.e. they want a replica of the scheduler's state).
> At any rate, i'd love to offer this behavior!
>
> 2. Would it possible to set a Pending time-out for tasks that take too long
>> to be Assigned?
>
> Not currently.  You could implement this by polling the API and killing
> tasks that took too long to schedule.  This would allow you to decide how
> to react (if at all).
>
> 3. Would it possible to have Aurora handle tasks with no resource
>> constraints?
>
> No.  Both Aurora and Mesos require CPU and memory to be specified for
> tasks.  A client of Aurora could choose defaults, but the real question is:
> what behavior do you want from scheduling without accounting?
>
>
> On Mon, Nov 23, 2015 at 5:34 PM, Riccardo Poggi 
> wrote:
>
>> Hello,
>>
>> In order to introduce Aurora into our distributed system we would like to
>> have it slowly, and hopefully transparently, replace what is currently the
>> process manger component. To do that it would have to programmatically
>> interface with other parts of the system that are, at the moment, taking
>> care of what it could be considered the "active" orchestration.
>>
>> I've tried Aurora and looked at the docs, but I'm still left with some
>> open questions:
>>
>> 1. Would it possible to subscribe to state change for a given job/task and
>> receive notifications?
>>
>> 2. Would it possible to set a Pending time-out for tasks that take too
>> long to be Assigned?
>>
>> 3. Would it possible to have Aurora handle tasks with no resource
>> constraints?
>>
>>
>> Thanks,
>>Riccardo
>>



Aurora master seems broken

2015-10-07 Thread Erb, Stephan
Hi,

does anyone (who is not on the move today) has time to look into that bug here? 
 If Aurora is broken during MesosCon it will make  it quite more difficult for 
new people to play round with it.

https://issues.apache.org/jira/browse/AURORA-1513

Thanks and Best Regards,
Stephan

Meaning and usage of job environments

2015-09-02 Thread Erb, Stephan
Hi everyone,

I have been wondering about the environment part of Aurora jobkeys (devel, 
test, staging, staging1, ...prod). 

I guess you made the environment a first class citizen in order to enforce some 
kind of standardization. How is this working out for you in practice? 

IIRC, the current set of supported environment names is enforced by the client. 
In general, would you be opposed to moving this validation to the scheduler, so 
that it can be configured cluster wide? This would enable us to introduce 
environment names which are closer to our domain and organization.  Or is this 
even somewhat covered in the upcoming job tier implementation?

Thanks for your input,
Stephan

SSL Support

2015-08-13 Thread Erb, Stephan
Hi,

let's say I want to run Aurora using SSL. How would I do that? Hide it behind a 
reverse proxy? 

At least the client seems to have some kind of https related bits and pieces 
[1].

Best Regards,
Stephan

[1] 
https://github.com/apache/aurora/blob/8bdfb8500e792da199bd8cc9fed38d36e2448e81/src/main/python/apache/aurora/client/api/scheduler_client.py#L160

Blue Yonder is using Aurora

2015-07-19 Thread Erb, Stephan
Hi everyone,

we are happy to announce that Blue Yonder (http://www.blue-yonder.com/en/) is 
using Apache Aurora.

We have published a short explanation of our current usecase [1]  and are 
looking forward to further collaboration with the community.

[1] 
http://www.blue-yonder.com/blog-e/2015/07/18/scalable-data-science-apache-aurora-at-blue-yonder/
 

Kind Regards,
Stephan







Re: Using Prometheus with Aurora

2015-07-01 Thread Erb, Stephan
Hi Brian,

that will come in handy. Thanks for that!

I have thought about using prometheus together with the Mesos and Aurora 
exporters. Your feature would even allow us to start both exports on Aurora and 
let Prometheus discover them automatically. However, it might be a bit risky to 
use Aurora/Mesos to host its own monitoring solution...

How have you organized the deployment of your Prometheus components? 

Regards,
Stephan

From: Brian Brazil brian.bra...@boxever.com
Sent: Wednesday, July 1, 2015 4:54 PM
To: dev@aurora.apache.org; prometheus-developers
Subject: Using Prometheus with Aurora

Hi,

I've added support to Prometheus to use Twitter serversets for service
discovery, and we're now got this deployed in production. A key point is
that you can configure it to look at a tree of ZNodes in Zookeeper, so that
it can automatically pick up any new jobs announced by the Aurora executor
with no additional configuration required.


The scrape config we have in ansible for this is:

  - job_name: aurora_job
serverset_sd_configs:  - servers:{% for host in
groups[zookeeper] %}  - {{ host }}:2181{% endfor %}
paths:  - /aurora/serversetsrelabel_configs:  -
source_labels: [ __meta_serverset_path ]regex:
'^/aurora/serversets/[^/]+/[^/]+/([^/]+)/.*'target_label: job
  replacement: ${1}  - source_labels: [ __address__ ]
regex: '(ip-\d+-\d+-\d+-\d+)\..*:(\d+)'target_label:
__address__replacement: ${1}:${2}



The relabel of the addresses is as our mesos slaves have the externally
visible host names so they work nicely in the browser, but which don't
actually exist in DNS so need some munging.

Note that you'll need the Prometheus at head to use this, and that this
should also work for non-Aurora created serversets such as if you're using
Finagle.

Relevant docs:
http://aurora.apache.org/documentation/latest/configuration-reference/#announcer-objects

http://prometheus.io/docs/operating/configuration/#zookeeper-serverset-sd-configurations-serverset_sd_config

Brian


Re: non-prod SLA stats

2015-06-16 Thread Erb, Stephan
Hi Maxim,

I have submitted a first patch closely following your initial proposal.  The 
patch needs another iteration or two, so please let me know what you think.

https://reviews.apache.org/r/35498/ 

Thanks,
Stephan


From: Maxim Khutornenko ma...@apache.org
Sent: Monday, June 1, 2015 6:30 PM
To: dev@aurora.apache.org
Subject: Re: non-prod SLA stats

Hi Stephan,

Thanks for you analysis. I must mention though that SLA algorithms
were optimized for readability rather than CPU performance.

Given the current minutely run cycle, I would not be concerned about
calculation delay unless it threatens to break the schedule. In a
large cluster with tens of thousands of SLA metrics (both prod and
non-prod) the average observed SLA run time is around 4 seconds, which
gives us plenty of headroom for growth.

I am more concerned about the heap space used to store computed
counters here. This may quickly become a bottleneck and as a side
effect make our /vars endpoint unusable. Hence, the suggestion to make
non-essential stats fully toggle-able.

That said, if you envision a different use case with a much larger
metric set or anticipate a more frequent run schedule - feel free to
propose patches. I'd also encourage to invest some time into an SLA
benchmark using our JMH-based harness to back your changes with real
perf data.

Thanks,
Maxim

On Mon, Jun 1, 2015 at 3:26 AM, Erb, Stephan
stephan@blue-yonder.com wrote:
 Hi Maxim,

 introducing some toggles for metric collection should definitely work and can 
 be contributed via a simple pull request.

 However, if your are only concerned about a potential performance hit, we 
 might as well think about tuning the existing metric calculation. I have 
 skimmed the code, and there seem to be several more or less low-hanging 
 fruits:

 * The uptime computation performs the task enumeration and sorting operation 
 for every percentile, whereas this only needs to be done once.
 * The current approach used to compute a percentile takes O(n log n) time. 
 There are alternative solutions running in only O(n) time.
 * There are some unnecessary allocations, i.e., SlaUtil.percentile() is 
 always called on a temporary list. However, the first thing it does is to 
 create a copy of that list.

 How about that: I will file a ticket for non-prod SLA stats and contribute a 
 simple patch with toggles. If it turns out that these are unusable for 
 twitter-scale, we can look into basic performance tuning.

 Best Regards,
 Stephan
 
 From: Maxim Khutornenko ma...@apache.org
 Sent: Friday, May 29, 2015 7:23 PM
 To: dev@aurora.apache.org
 Subject: Re: non-prod SLA stats

 Hi Stephan,

 Tracking the same set of metrics for all non-prod jobs could be
 somewhat expensive on both collection and consumption sides. The only
 metrics we currently chose to collect are MTTA/R to help us monitor
 scheduling rate in view of reduced cluster capacity (AURORA-774).
 Perhaps we could put non-prod collection behind a set of command line
 switches (ArgBoolean)? E.g.:

 SLA_COLLECT_NON_PROD_MEDIANS
 SLA_COLLECT_NON_PROD_JOB_UPTIMES
 SLA_COLLECT_NON_PROD_PLATFORM_UPTIMES

 These could be defined in SlaModule and injected into MetricCalculator
 to let us finely tune the required non-prod collection set. What do
 you think?

 Thanks,
 Maxim

 On Fri, May 29, 2015 at 7:09 AM, Erb, Stephan
 stephan@blue-yonder.com wrote:
 Hi everyone,

 we are are interested in the job uptime percentiles and the aggregate 
 cluster uptime percentage not only for production jobs, but also for our 
 non-production jobs.

 Are there any reasons why those are not available in a non-prod version, 
 similar to the current handling of mtta and mttr [1]?  If there are no 
 objections, I will prepare a patch.

 Regards,
 Stephan

 [1] 
 https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/sla/MetricCalculator.java#L69

Re: Health Checks for Updates design review

2015-05-06 Thread Erb, Stephan
Hi Maxim,

I am not keen on the potential risk of tasks getting stuck in STARTING. We 
perform auto-scaling of jobs, so there might be nobody around to notice and 
correct the problem in time.

How about keeping the initial_interval_secs and just change its meaning to be 
grace period, so that health checks are triggered but errors ignored during 
this interval.

The initial_interval_secs is then a user-configurable upper bound of when a job 
is meant to be working. It can even be set rather high, because it won't affect 
the update performance.

What do you think?

Best Regards,
Stephan

From: Maxim Khutornenko ma...@apache.org
Sent: Tuesday, May 5, 2015 10:24 PM
To: dev@aurora.apache.org
Subject: Health Checks for Updates design review

Hi,

I have put together a design proposal for improving health-enabled job
update performance. Please, review and leave your comments:

https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit

Thanks,
Maxim

Re: Graceful task shutdown

2015-04-07 Thread Erb, Stephan
Brian, do you have any particular plans regarding your shutdown requirements? I 
have seen that you have filed another issue [1] which is also concerned with 
graceful shutdown.

Stephan

PS: For what it's worth, I implemented the 'quick fix' version to my problem 
stated in the beginning of this thread [2].

[1] https://issues.apache.org/jira/browse/AURORA-1257
[2] https://reviews.apache.org/r/32889/


From: Brian Brazil brian.bra...@boxever.com
Sent: Tuesday, March 24, 2015 10:48 PM
To: d...@aurora.incubator.apache.org
Subject: Re: Graceful task shutdown

On 24 March 2015 at 21:33, George Sirois geo...@tellapart.com wrote:

 Unfortunately I don't think my change will be able to make it in as-is.

 As Brian Wickman pointed out, it could introduce serious problems because
 there are varying timeouts across the scheduler/executor, so if you set
 your wait time to be too high, the scheduler might start to consider the
 tasks lost because they stayed in the transient KILLING state for too long.


Hmm, what sort of work is involved in resolving that?

In my case I need at least 12s after the /qqq before sending the TERM.

Brian



 I do think the lifecycle modules idea would solve Stephan's issue.

 On Tue, Mar 24, 2015 at 5:06 PM, Brian Brazil brian.bra...@boxever.com
 wrote:

  On 24 March 2015 at 20:57, Erb, Stephan stephan@blue-yonder.com
  wrote:
 
   Hi everyone,
  
   we are implementing the /health endpoint in our services but omit the
   implementation of the unauthenticated lifecycle methods /quitquitquit
 and
   /abortabortabort.
  
   As a consequence, stopping a service is taxed by 10 seconds waiting
 time
   [1]. I would like to get rid of this unnecessary delay and can think of
  two
   solutions:
  
   a) Only perform the escalation wait when the http_signaler reports that
   the message could be delivered to the service. This is a rather simple
  and
   localized fix.
  
   b) Use another port for lifecycle events. This would require a new
   addition to the task configuration and proper plumbing throughout the
  rest
   of the system. Backward compatibility could be achieved by using
 'health'
   as the default lifecycle management port.
  
   Any thoughts? I would be happy with the simple solution, but in the end
   it's your call :-)
  
 
  __george mentioned on IRC working on a change that'll let the wait time
 be
  configurable (which is something I also need), would that cover your use
  case?
 
  There were also discussions on IRC about custom lifecycle modules.
 
  Brian
 
 
  
   Best Regards,
   Stephan
  
   [1]
  
 
 https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/executor/thermos_task_runner.py#L123
 


Re: Pass slave-ip to user process

2015-04-07 Thread Erb, Stephan
Done :-)
https://issues.apache.org/jira/browse/AURORA-1261 


From: Zameer Manji zma...@apache.org
Sent: Tuesday, April 7, 2015 10:19 PM
To: dev@aurora.apache.org
Subject: Re: Pass slave-ip to user process

Variables like {{mesos.hostname}} and {{mesos.ip}} seem like reasonable
variables. Please file a ticket requesting these.

On Tue, Apr 7, 2015 at 1:11 PM, Erb, Stephan stephan@blue-yonder.com
wrote:

 Hi,

 you are right, network interface handling should probably be done in
 Mesos. However, I thought Aurora could simply expose the information it
 already has about the slave, for example in form of new thermos variables
 such as {{mesos.hostname}} and {{mesos.ip}}.

 For now, we will stick to our workaround with a config file in /etc. It
 will do the trick for the foreseeable future.

 Best Regards,
 Stephan
 
 From: Zameer Manji zma...@apache.org
 Sent: Saturday, April 4, 2015 12:10 AM
 To: dev@aurora.apache.org
 Subject: Re: Pass slave-ip to user process

 Hey,

 I don't think there is a way to solve this problem currently in Aurora. I
 think a problem of this nature would be better solved by the Mesos devs.
 Ideally through FS isolation or containerization method operators can
 select which network interface is available to user applications.

 I think the next step would be to file a MESOS ticket and see what the
 developers there think.

 On Mon, Mar 30, 2015 at 7:28 AM, Erb, Stephan stephan@blue-yonder.com
 
 wrote:

  Hi everyone,
 
  we are running our Mesos slaves on hosts with multiple network interfaces
  and would like to specifically bind started services to the ip used to
  start the Mesos slave (as specified via --ip).
 
  Mesos seems to export this ip via the LIBPROCESS_IP environment variable.
  However thermos is explicitly overwriting the environment [1].
 
  As a workaround, we are reading a global config in /etc, but with
  progressing disk isolation in Mesos this does not seem like a good long
  term solution.
 
  Any ideas how to properly solve this problem in Aurora?
 
  Regards,
  Stephan
 
  [1]
 
 https://github.com/apache/aurora/blob/master/src/main/python/apache/thermos/core/process.py#L341
 
  --
  Zameer Manji
 
 
  
 https://github.com/apache/aurora/blob/master/src/main/python/apache/thermos/core/process.py#L341
 

 --
 Zameer Manji