Re: Volunteers needed

2018-09-18 Thread Bill Farner
I’m happy to pitch in for periodic review.  Anyone is welcome to email me
requesting a review.  I don’t monitor incoming reviews, so unfortunately I
will need to be contacted out-of-band.

On Tue, Sep 18, 2018 at 10:45 AM Renan DelValle 
wrote:

> All,
>
> We are in dire need of folks who would be willing to commit time to review
> patches and submit patches to maintain the project. Small things like
> submitting a patch to upgrade our Mesos dependency (or any other dependency
> really) go a long way towards keeping the project up to date.
>
> Unfortunately, many of the members of the Aurora Project Management
> Committee (PMC) have either moved on from the project or are not in a
> position to dedicate time to the project.
>
> I brought up the topic of PMC inactivity on September 5th to the PMC.
> Until today I've only heard from two other PMC members about this topic
> privately. That is the current situation the project is currently in.
>
> This means there is a very real chance that if we don't get volunteers,
> the project will fall behind and, ultimately, become unmaintained.
>
> Therefore if you use this project and would like to see its development
> continue, please consider helping us maintain it by submitting patches or
> code reviews.
>
> Thanks
>
> -Renan
>


Re: Detecting Flapping Tasks in Aurora

2018-02-01 Thread Bill Farner
You could scan for tasks that are in, or have been in the THROTTLED

state.  You can adjust the time intervals for throttled tasks with these
scheduler args

.

Also, do we have a telegraf plugin for Aurora?


Not that i'm aware of.  Let me know if you need any pointers with how stats
are exported from Aurora to do this.

On Thu, Feb 1, 2018 at 3:15 PM, De, Bipra  wrote:

> Hello Everyone,
>
>
>
> I am working on an alert system that will call Aurora APIs to detect jobs
> that have flapping tasks. It runs every hour.
>
>
>
> Any suggestions on how to detect such jobs that have tasks flapping,
> provided those tasks were submitted to aurora as part of the same request.
> This is to filter out the cases where a user tries to submit a job multiple
> times but each time it failed.
>
>
>
> Also, do we have a telegraf plugin for Aurora?
>
>
>
> Regards,
>
> Bipra.
>


Re: shutdown vs kill API is Mesos

2018-01-16 Thread Bill Farner
>
> We still need "Agent ID" for the shutdown call.


Darn.  In that case, how about we change the method signature in Driver to
accept agentId and ignore that param in MesosSchedulerDriver.

But do we really need the command line option?


Aurora can run tasks without an executor.  I'm assuming the shutdown call
is incompatible with that mode.

On Tue, Jan 16, 2018 at 1:57 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> We still need "Agent ID" for the shutdown call.
>
> On Tue, Jan 16, 2018 at 1:57 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:
>
>> Sounds good. But do we really need the command line option? One can use
>> an older Driver if KILL is preferred for some reason.
>>
>> On Tue, Jan 16, 2018 at 1:51 PM, Bill Farner <wfar...@apache.org> wrote:
>>
>>> This situation is much simpler if task ID == executor ID.  I can't come
>>> up with a good reason why this is not the case today.  Our executor IDs
>>> originally included static prefix, though i do not recall any rationale for
>>> this.  When Renan added custom executor support, this static prefix was
>>> made configurable.  Again, i do not believe there was any rationale for the
>>> utility of executor IDs.
>>>
>>> I propose the following:
>>> - Change relevant code in MesosTaskFactory to
>>> setExecutorId(task.getTaskId())
>>> - Add a command line parameter (default false) to toggle use of executor
>>> shutdown in VersionedSchedulerDriverService.killTask
>>>
>>> Does anyone see an issue with this approach?
>>>
>>> On Tue, Jan 16, 2018 at 11:15 AM, Mohit Jaggi <mohit.ja...@uber.com>
>>> wrote:
>>>
>>>> To do this in a backward compatible manner, one way is :
>>>>
>>>> ```
>>>> void destroy(taskId, executorId, agentId) {
>>>>
>>>> if(driver instanceOf Versioned)
>>>>(Versioned...)driver.shutdown(executorId, agentId)
>>>> else
>>>>driver.kill(taskId)
>>>>
>>>> }
>>>> ```
>>>>
>>>> Any other opinions?
>>>>
>>>> On Tue, Jan 16, 2018 at 11:12 AM, David McLaughlin <
>>>> dmclaugh...@apache.org> wrote:
>>>>
>>>>> Nope, I support getting SHUTDOWN in for users of the new API.
>>>>>
>>>>> On Tue, Jan 16, 2018 at 11:06 AM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> Are you suggesting that we delay the switch to SHUTDOWN call until
>>>>>> this working group can resolve the API perf issue?
>>>>>>
>>>>>> On Mon, Jan 15, 2018 at 3:55 PM, David McLaughlin <
>>>>>> dmclaugh...@apache.org> wrote:
>>>>>>
>>>>>>> We are working with Mesos folks to resolve it. There is a Mesos
>>>>>>> performance working group that folks can join if they'd like to 
>>>>>>> contribute:
>>>>>>> http://mesos.apache.org/blog/performance-working-group-progr
>>>>>>> ess-report/
>>>>>>>
>>>>>>> I'm not sure what you mean by branch. Everything we used to scale
>>>>>>> test is on master.
>>>>>>>
>>>>>>> On Mon, Jan 15, 2018 at 10:08 AM, Meghdoot bhattacharya <
>>>>>>> meghdoo...@yahoo.com> wrote:
>>>>>>>
>>>>>>>> David, should twitter try against mesos 1.5 to see if things are
>>>>>>>> better with the new api instead of libmesos. This is going to be a 
>>>>>>>> drift
>>>>>>>> over time that will stop us from adopting new features.
>>>>>>>>
>>>>>>>> If it was sometime back it would be good to rerun the tests and
>>>>>>>> open a ticket in Mesos if issues exist. All aurora users can then push 
>>>>>>>> for
>>>>>>>> resolution.
>>>>>>>>
>>>>>>>> Also details on branch etc that has the api integration?
>>>>>>>>
>>>>>>>> Thx
>>>>>>>>
>>>>>>>> On Jan 12, 2018, at 11:39 AM, David McLaughlin <
>>>>>>>> dmclaugh...@apache.org> wrote:
>>>>>>>>
>>>>>>>> I'm not sure I agree with the summary. Bill's proposal was using
>>>>>>>> shutdown only when using the new API. I would also support this if it's
>>>>>>>> possible.
>>>>>>>>
>>>>>>>> On Fri, Jan 12, 2018 at 11:14 AM, Mohit Jaggi <mohit.ja...@uber.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Summary so far:
>>>>>>>>> - Bill supports making this change
>>>>>>>>> - This change cannot be made in a backward compatible manner
>>>>>>>>> - David (Twitter) does not want to use HTTP APIs due to
>>>>>>>>> performance concerns. I conclude that folks from Twitter don't 
>>>>>>>>> support this
>>>>>>>>> change
>>>>>>>>>
>>>>>>>>> Question:
>>>>>>>>> - Are there other users that want this change?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: shutdown vs kill API is Mesos

2018-01-16 Thread Bill Farner
This situation is much simpler if task ID == executor ID.  I can't come up
with a good reason why this is not the case today.  Our executor IDs
originally included static prefix, though i do not recall any rationale for
this.  When Renan added custom executor support, this static prefix was
made configurable.  Again, i do not believe there was any rationale for the
utility of executor IDs.

I propose the following:
- Change relevant code in MesosTaskFactory to setExecutorId(task.getTaskId()
)
- Add a command line parameter (default false) to toggle use of executor
shutdown in VersionedSchedulerDriverService.killTask

Does anyone see an issue with this approach?

On Tue, Jan 16, 2018 at 11:15 AM, Mohit Jaggi  wrote:

> To do this in a backward compatible manner, one way is :
>
> ```
> void destroy(taskId, executorId, agentId) {
>
> if(driver instanceOf Versioned)
>(Versioned...)driver.shutdown(executorId, agentId)
> else
>driver.kill(taskId)
>
> }
> ```
>
> Any other opinions?
>
> On Tue, Jan 16, 2018 at 11:12 AM, David McLaughlin  > wrote:
>
>> Nope, I support getting SHUTDOWN in for users of the new API.
>>
>> On Tue, Jan 16, 2018 at 11:06 AM, Mohit Jaggi 
>> wrote:
>>
>>> Are you suggesting that we delay the switch to SHUTDOWN call until this
>>> working group can resolve the API perf issue?
>>>
>>> On Mon, Jan 15, 2018 at 3:55 PM, David McLaughlin <
>>> dmclaugh...@apache.org> wrote:
>>>
 We are working with Mesos folks to resolve it. There is a Mesos
 performance working group that folks can join if they'd like to contribute:
 http://mesos.apache.org/blog/performance-working-group-progress-report/

 I'm not sure what you mean by branch. Everything we used to scale test
 is on master.

 On Mon, Jan 15, 2018 at 10:08 AM, Meghdoot bhattacharya <
 meghdoo...@yahoo.com> wrote:

> David, should twitter try against mesos 1.5 to see if things are
> better with the new api instead of libmesos. This is going to be a drift
> over time that will stop us from adopting new features.
>
> If it was sometime back it would be good to rerun the tests and open a
> ticket in Mesos if issues exist. All aurora users can then push for
> resolution.
>
> Also details on branch etc that has the api integration?
>
> Thx
>
> On Jan 12, 2018, at 11:39 AM, David McLaughlin 
> wrote:
>
> I'm not sure I agree with the summary. Bill's proposal was using
> shutdown only when using the new API. I would also support this if it's
> possible.
>
> On Fri, Jan 12, 2018 at 11:14 AM, Mohit Jaggi 
> wrote:
>
>> Summary so far:
>> - Bill supports making this change
>> - This change cannot be made in a backward compatible manner
>> - David (Twitter) does not want to use HTTP APIs due to performance
>> concerns. I conclude that folks from Twitter don't support this change
>>
>> Question:
>> - Are there other users that want this change?
>>
>>
>>
>

>>>
>>
>


Re: explain these replication logs?

2017-12-13 Thread Bill Farner
I'm unfamiliar.  The mesos dev list may be able to give more insight.  I'd
be interested in your findings!

On Tue, Dec 12, 2017 at 4:32 PM, Mohit Jaggi  wrote:

> For the same position I see two bursts of writes, one around 00:12:36 and
> another 12 min earlier. Any idea what this means?
>
> ~/a/a/aurora-outage ❯❯❯ grep 67516183 cpp-repl-logs
> Nov  8 00:12:36 host1161 aurora-scheduler[112446]: I1108 00:12:36.979269
> 112579 replica.cpp:390] Replica received explicit promise request from
> __req_res__(7)@172.0.6.42.2:8083 for position 67516183 with proposal 33898
> Nov  8 00:12:36 host1161 aurora-scheduler[112446]: I1108 00:12:36.982532
> 112579 replica.cpp:710] Persisted action NOP at position 67516183
> Nov  8 00:12:36 host1161 aurora-scheduler[112446]: I1108 00:12:36.990187
> 112580 replica.cpp:693] Replica received learned notice for position
> 67516183 from @0.0.0.0:0
> Nov  8 00:12:36 host1161 aurora-scheduler[112446]: I1108 00:12:36.994510
> 112580 replica.cpp:710] Persisted action APPEND at position 67516183
> Nov  8 00:12:36 host2091 aurora-scheduler[80146]: I1108 00:12:36.978763
> 80281 replica.cpp:390] Replica received explicit promise request from
> __req_res__(6)@172.0.6.42.2:8083 for position 67516183 with proposal 33898
> Nov  8 00:12:36 host2091 aurora-scheduler[80146]: I1108 00:12:36.989364
> 80281 replica.cpp:710] Persisted action NOP at position 67516183
> Nov  8 00:12:36 host2091 aurora-scheduler[80146]: I1108 00:12:36.989794
> 80278 replica.cpp:693] Replica received learned notice for position
> 67516183 from @0.0.0.0:0
> Nov  8 00:12:37 host2091 aurora-scheduler[80146]: I1108 00:12:37.005336
> 80278 replica.cpp:710] Persisted action APPEND at position 67516183
> Nov  8 00:00:32 host1162 aurora-scheduler[14638]: I1108 00:00:32.736395
> 14772 coordinator.cpp:348] Coordinator attempting to write APPEND action at
> position 67516183
> Nov  8 00:00:32 host1162 aurora-scheduler[14638]: I1108 00:00:32.736794
> 14756 replica.cpp:539] Replica received write request for position 67516183
> from __req_res__(4)@172.0.8.42.11:8083
> Nov  8 00:00:32 host1162 aurora-scheduler[14638]: I1108 00:00:32.740519
> 14756 replica.cpp:710] Persisted action APPEND at position 67516183
> Nov  8 00:00:32 host1162 aurora-scheduler[14638]: I1108 00:00:32.749094
> 14764 replica.cpp:693] Replica received learned notice for position
> 67516183 from @0.0.0.0:0
> Nov  8 00:00:32 host1162 aurora-scheduler[14638]: I1108 00:00:32.749300
> 14764 replica.cpp:710] Persisted action APPEND at position 67516183
> Nov  8 00:12:36 host1162 aurora-scheduler[46132]: I1108 00:12:36.992617
> 46463 replica.cpp:390] Replica received explicit promise request from
> __req_res__(8)@172.0.6.42.2:8083 for position 67516183 with proposal 33898
> Nov  8 00:12:36 host1162 aurora-scheduler[46132]: I1108 00:12:36.993018
> 46463 replica.cpp:710] Persisted action APPEND at position 67516183
> Nov  8 00:12:36 host1162 aurora-scheduler[46132]: I1108 00:12:36.993108
> 46463 replica.cpp:693] Replica received learned notice for position
> 67516183 from @0.0.0.0:0
> Nov  8 00:12:36 host1162 aurora-scheduler[46132]: I1108 00:12:36.993345
> 46463 replica.cpp:710] Persisted action APPEND at position 67516183
> Nov  8 00:12:37 host1159 aurora-scheduler[37324]: I1108 00:12:36.978830
> 37443 replica.cpp:390] Replica received explicit promise request from
> __req_res__(10)@172.0.6.42.2:8083 for position 67516183 with proposal
> 33898
> Nov  8 00:12:37 host1159 aurora-scheduler[37324]: I1108 00:12:36.988788
> 37443 replica.cpp:710] Persisted action APPEND at position 67516183
> Nov  8 00:12:37 host1159 aurora-scheduler[37324]: I1108 00:12:36.989609
> 37444 replica.cpp:693] Replica received learned notice for position
> 67516183 from @0.0.0.0:0
> Nov  8 00:12:37 host1159 aurora-scheduler[37324]: I1108 00:12:36.989812
> 37444 replica.cpp:710] Persisted action APPEND at position 67516183
> Nov  8 00:00:32 host1159 aurora-scheduler[37324]: I1108 00:00:32.737551
> 37453 replica.cpp:539] Replica received write request for position 67516183
> from __req_res__(6)@172.0.8.42.11:8083
> Nov  8 00:00:32 host1159 aurora-scheduler[37324]: I1108 00:00:32.748847
> 37453 replica.cpp:710] Persisted action APPEND at position 67516183
> Nov  8 00:00:32 host1159 aurora-scheduler[37324]: I1108 00:00:32.749682
> 37447 replica.cpp:693] Replica received learned notice for position
> 67516183 from @0.0.0.0:0
> Nov  8 00:00:32 host1159 aurora-scheduler[37324]: I1108 00:00:32.764811
> 37447 replica.cpp:710] Persisted action APPEND at position 67516183
> Nov  8 00:12:36 host1163 aurora-scheduler[91197]: I1108 00:12:36.979215
> 91339 replica.cpp:390] Replica received explicit promise request from
> __req_res__(9)@172.0.6.42.2:8083 for position 67516183 with proposal 33898
> Nov  8 00:12:36 host1163 aurora-scheduler[91197]: I1108 00:12:36.996575
> 91339 replica.cpp:710] Persisted action APPEND at position 67516183
> Nov  8 00:12:36 host1163 aurora-scheduler[91197]: 

Re: shutdown vs kill API is Mesos

2017-12-09 Thread Bill Farner
>
> The new API is present in Aurora in a compatibility layer


Aha!  I had not explored that code
<https://github.com/apache/aurora/blob/47c689956f77ed635d26f7ec659689002bd047af/src/main/java/org/apache/aurora/scheduler/mesos/VersionedSchedulerDriverService.java#L180-L185>
yet.  It does seem that SHUTDOWN provides the behavior that we aim for when
killing tasks.  The global executor shutdown timeout (
--executor_shutdown_grace_period) potentially interferes with our
graceful_shutdown_wait_secs job-level configuration.  However, an operator
could use the former as an upper limit to the latter.

>From what i see, i'd support a patch to switch to SHUTDOWN when using
DriverKind.V0_DRIVER or DriverKind.V1_DRIVER.

On Sat, Dec 9, 2017 at 4:27 PM, David McLaughlin <dmclaugh...@apache.org>
wrote:

> The new API is present in Aurora in a compatibility layer, but the HTTP
> performance issues still exist so we can't make it the default.
>
> On Sat, Dec 9, 2017 at 4:24 PM, Bill Farner <wfar...@apache.org> wrote:
>
>> Aurora pre-dates SHUTDOWN by several years, so the option was not
>> present.  Additionally, the SHUTDOWN call is not available in the API used
>> by Aurora.  Last i knew, Aurora could not use the "new" API because of
>> performance issues in the implementation, but i do not know where that
>> stands today.
>>
>> https://mesos.apache.org/documentation/latest/scheduler-
>> http-api/#shutdown
>>
>>> NOTE: This is a new call that was not present in the old API
>>
>>
>> On Sat, Dec 9, 2017 at 4:11 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:
>>
>>> Folks,
>>> Our Mesos team is wondering why Aurora chose KILL over SHUTDOWN for
>>> killing tasks. As Aurora has an executor per task, won't SHUTDOWN work
>>> better? It will avoid zombie executors.
>>>
>>> Mohit.
>>>
>>
>>
>


Re: [ANNOUNCE] 0.19.0 release

2017-12-07 Thread Bill Farner
Thanks for the reminder!  I will try to build and start a vote for these
tomorrow

On Wed, Dec 6, 2017 at 3:40 PM, Renan DelValle <renanidelva...@gmail.com>
wrote:

> Hi Bill,
>
> It's been almost a month since the release. Any idea when the official deb
> and rpm packages will be released?
>
> -Renan
>
> On Sat, Nov 11, 2017 at 8:50 AM, Bill Farner <wfar...@apache.org> wrote:
>
>> Hello folks,
>>
>> Aurora 0.19.0 has been released!  Please see the blog post for more
>> details: https://aurora.apache.org/blog/aurora-0-19-0-released/
>>
>>
>> Cheers,
>>
>> Bill
>>
>
>


Re: sliding stats testing

2017-12-02 Thread Bill Farner
The underlying Rate stats used here are only updated when sampled, so the
value you have sent to accumulate() is not reflected in rates and ratios
until doSample() is called on them.  For the purposes of this test, it may
be easiest to integrate with TimeSeriesRepositoryImpl and manually induce
sampling.

On Sat, Dec 2, 2017 at 12:17 PM, Mohit Jaggi  wrote:

> Folks,
> I am trying to write a test case and could not find one to refer to. I
> want to set writerWaitStats below to a large value to simulate high values
> for log_storage_write_lock_wait_ns_per_event
> I tried calling accumulate once with a large value or several times with
> large values but it is always zero for log_storage_write_lock_
> wait_ns_per_event
> What am I missing?
>
> Mohit.
>
>
> private SlidingStats writerWaitStats = new 
> SlidingStats("log_storage_write_lock_wait", "ns");
>
> writerWaitStats.accumulate(10L);
>
>


Re: Aurora pauses adding offers

2017-11-29 Thread Bill Farner
>
> Do you mean there is a shared ZK connection for leadership and log
> replication


Nope, different connections.  The log is all managed through code linked
from libmesos, including its use of ZK.  Here's an example of some logs
from this code:

I1129 15:56:00.560479  9316 group.cpp:341] Group process
> (zookeeper-group(1)@192.168.33.7:8083) connected to ZooKeeper
> I1129 15:56:00.560817  9316 group.cpp:831] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I1129 15:56:00.561249  9316 group.cpp:419] Trying to create path
> '/aurora/replicated-log' in ZooKeeper


Notice the group.cpp.  You'll also see relevant logs coming from log.cpp
and replica.cpp.


On Wed, Nov 29, 2017 at 3:25 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> Do you mean there is a shared ZK connection for leadership and log
> replication? I don't see "Lost leadership, committing suicide" during
> outage. I do see it at other times.
>
> On Wed, Nov 29, 2017 at 1:52 PM, Bill Farner <wfar...@apache.org> wrote:
>
>> - Does log replication "maintain" ZK connections and suffer when a NIC
>>> flaps?
>>
>>
>> Maintain, yes.  Shouldn't be impacted unless the ZK session expires,
>> which would trigger a scheduler failover.
>>
>> - If only 1 of 5 ZK's have this issue, could there still be a problem?
>>
>>
>> Assuming this means 5 quorum member - no, that should not be a problem.
>>
>> If any of the above became an issue for the scheduler, it should
>> certainly manifest in logs.
>>
>>
>> On Wed, Nov 29, 2017 at 1:26 PM, Mohit Jaggi <mohit.ja...@uber.com>
>> wrote:
>>
>>> Thanks Bill. It would be the latter as this back-pressure is only needed
>>> for calls that change state. Read only calls should be quite quick to
>>> serve.
>>>
>>> One potential correlated outage we may have had is a NIC flap on a
>>> Zookeeper node. The following questions come to mind:
>>> - Does log replication "maintain" ZK connections and suffer when a NIC
>>> flaps?
>>> - If only 1 of 5 ZK's have this issue, could there still be a problem?
>>>
>>> On Wed, Nov 29, 2017 at 11:08 AM, Bill Farner <wfar...@apache.org>
>>> wrote:
>>>
>>>> is there a place I can inject a pre-processor for the API calls
>>>>
>>>>
>>>> There is no off-the-shelf way to do this.  You could intercept API
>>>> calls as HTTP requests similar to the CORS filter
>>>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>.
>>>> If you wanted to intercept specific calls and/or introspect arguments, you
>>>> would be better off binding a layer for AuroraAdmin.Iface
>>>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30>
>>>> .
>>>>
>>>> On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mohit.ja...@uber.com>
>>>> wrote:
>>>>
>>>>> I agree with that. I also believe that the scheduler should be
>>>>> resilient in the presence of external faults. Systems that export an API
>>>>> must take defensive steps to protect themselves.
>>>>>
>>>>> If I wanted to experiment with this change without modifying Aurora
>>>>> code "inline", is there a place I can inject a pre-processor for the API
>>>>> calls?
>>>>>
>>>>> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wfar...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I'd also suggest focusing on the source of the congestion.  Aurora
>>>>>> should offer quite high scheduling throughput, and i would rather focus
>>>>>> energy on addressing bottlenecks.
>>>>>>
>>>>>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think more explicit signaling is better. Increased latency can be
>>>>>>> due to other conditions like network issues etc. Right now our 
>>>>>>> mitigation
>>>>>>> involves load-shedding and we would rather have load-avoidance. Indeed
>>>>>>> proxy is not a good option. Only Aurora "knows" when it wants to
>>>>>>> back-pressure.
>>>>>>>
>>>>>>> On Mon, Nov 27, 2017 at 12:5

Re: Aurora taking really long to reschedule a full cluster

2017-11-29 Thread Bill Farner
That works out to scheduling about 1 task/sec, which is at least one order
of magnitude lower than i would expect.  Are you sure tasks were scheduling
and continuing to run, rather than exiting/failing and triggering more
scheduling work?

What build is this from?  Can you share (scrubbed) scheduler logs from this
period?

On Wed, Nov 29, 2017 at 11:54 AM, Mauricio Garavaglia <
mauriciogaravag...@gmail.com> wrote:

> Hello!
>
> Recently, running some reliability tests, we restarted all the nodes in a
> cluster of ~300 hosts and 3k tasks. Aurora took about 1hour to reschedule
> everything, we have a change of leader in the middle of the scheduling and
> that slowed it down even more. So we started looking which aurora
> parameters needed more tuning.
>
> The value of max_tasks_per_schedule_attempt is set to the default now,
> that probably needs to be increased, is there a rule of thumb to tune it
> based on cluster size, # of jobs, # of frameworks, etc?
>
> Regarding the JVM, we are running it with Xmx=24G; so far we haven't seen
> pressure there.
>
> Any input on where to look at would be really appreciated :)
>
> Mauricio
>
>
>
>
>


Re: Aurora pauses adding offers

2017-11-29 Thread Bill Farner
>
> is there a place I can inject a pre-processor for the API calls


There is no off-the-shelf way to do this.  You could intercept API calls as
HTTP requests similar to the CORS filter
<https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>.
If you wanted to intercept specific calls and/or introspect arguments, you
would be better off binding a layer for AuroraAdmin.Iface
<https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30>
.

On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> I agree with that. I also believe that the scheduler should be resilient
> in the presence of external faults. Systems that export an API must take
> defensive steps to protect themselves.
>
> If I wanted to experiment with this change without modifying Aurora code
> "inline", is there a place I can inject a pre-processor for the API calls?
>
> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wfar...@apache.org> wrote:
>
>> I'd also suggest focusing on the source of the congestion.  Aurora should
>> offer quite high scheduling throughput, and i would rather focus energy on
>> addressing bottlenecks.
>>
>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mohit.ja...@uber.com>
>> wrote:
>>
>>> I think more explicit signaling is better. Increased latency can be due
>>> to other conditions like network issues etc. Right now our mitigation
>>> involves load-shedding and we would rather have load-avoidance. Indeed
>>> proxy is not a good option. Only Aurora "knows" when it wants to
>>> back-pressure.
>>>
>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <
>>> dmclaugh...@apache.org> wrote:
>>>
>>>> Any log write latency will be reflected in the overall latency of the
>>>> request. Increased request latency is one of the main ways any server has
>>>> of telling a client that it's under load. It's then up to the client to
>>>> react to this.
>>>>
>>>> If you want to throw error codes, you can put a proxy in front of
>>>> Aurora that has request timeouts - which would send 503s to clients. But
>>>> the issue with that is the requests are mostly non-idempotent so you'll
>>>> need to build reconciliation logic into it.
>>>>
>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>> wrote:
>>>>
>>>>> Imagine something like Spinnaker using Aurora underneath to schedule
>>>>> services. That layer often "amplifies" human effort and may result in a 
>>>>> lot
>>>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to
>>>>> transient problems, it can signal that to upstream software in the same 
>>>>> way
>>>>> that busy web servers do during cyber Monday sales :-)
>>>>>
>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wfar...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I want to let upstream software "know" that Aurora is slowing down
>>>>>>> and that it should back off
>>>>>>
>>>>>>
>>>>>> Can you offer more detail about how Aurora is being used in this
>>>>>> regard?  I haven't seen use cases in the past that would be amenable to
>>>>>> this behavior, so i would like to understand better.
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>>>> cause(although ZK node is known to have issues now and then but we don't
>>>>>>> have logs for the specific outages we had). We will plan to move to 
>>>>>>> 0.19.x
>>>>>>> soon. In addition I want to let upstream software "know" that Aurora is
>>>>>>> slowing down and that it should back off. To achieve this I want to send
>>>>>>> 5xx error codes back when update/rollback/kill etc are called and 
>>>>>>> certain
>>>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, 
>>>>>>> this
>>>>>>> "defense" already exists?
>>>>

Re: Aurora pauses adding offers

2017-11-27 Thread Bill Farner
I'd also suggest focusing on the source of the congestion.  Aurora should
offer quite high scheduling throughput, and i would rather focus energy on
addressing bottlenecks.

On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> I think more explicit signaling is better. Increased latency can be due to
> other conditions like network issues etc. Right now our mitigation involves
> load-shedding and we would rather have load-avoidance. Indeed proxy is not
> a good option. Only Aurora "knows" when it wants to back-pressure.
>
> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <dmclaugh...@apache.org
> > wrote:
>
>> Any log write latency will be reflected in the overall latency of the
>> request. Increased request latency is one of the main ways any server has
>> of telling a client that it's under load. It's then up to the client to
>> react to this.
>>
>> If you want to throw error codes, you can put a proxy in front of Aurora
>> that has request timeouts - which would send 503s to clients. But the issue
>> with that is the requests are mostly non-idempotent so you'll need to build
>> reconciliation logic into it.
>>
>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mohit.ja...@uber.com>
>> wrote:
>>
>>> Imagine something like Spinnaker using Aurora underneath to schedule
>>> services. That layer often "amplifies" human effort and may result in a lot
>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to
>>> transient problems, it can signal that to upstream software in the same way
>>> that busy web servers do during cyber Monday sales :-)
>>>
>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wfar...@apache.org>
>>> wrote:
>>>
>>>> I want to let upstream software "know" that Aurora is slowing down and
>>>>> that it should back off
>>>>
>>>>
>>>> Can you offer more detail about how Aurora is being used in this
>>>> regard?  I haven't seen use cases in the past that would be amenable to
>>>> this behavior, so i would like to understand better.
>>>>
>>>>
>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mohit.ja...@uber.com>
>>>> wrote:
>>>>
>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>> cause(although ZK node is known to have issues now and then but we don't
>>>>> have logs for the specific outages we had). We will plan to move to 0.19.x
>>>>> soon. In addition I want to let upstream software "know" that Aurora is
>>>>> slowing down and that it should back off. To achieve this I want to send
>>>>> 5xx error codes back when update/rollback/kill etc are called and certain
>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
>>>>> "defense" already exists?
>>>>>
>>>>>
>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wfar...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> The next level is to determine why the storage lock is being held.
>>>>>> Common causes include:
>>>>>>
>>>>>> 1. storage snapshot slowness, when scheduler state is very large,
>>>>>> O(gb)
>>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>>> 2. scheduler replicated log on slow disks
>>>>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>>>>> between zookeepers
>>>>>>
>>>>>> As an immediate (partial) remedy, i suggest you upgrade to eliminate
>>>>>> the use of SQL/mybatis in the scheduler.  This helped twitter improve (1)
>>>>>> and (1a).
>>>>>>
>>>>>> commit f2755e1
>>>>>> Author: Bill Farner <wfar...@apache.org>
>>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>>
>>>>>> Exclusively use Map-based in-memory stores for primary storage
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>>
>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>

Re: HTTP API examples

2017-11-27 Thread Bill Farner
It is true that thrift is the only supported API.  You are welcome to try
/apibeta, just be aware that issues you encounter may not be fixed.  That
said, it has been in place for ~3 years and would probably not be removed
unless it impedes other work, or a superior replacement is introduced.

On Mon, Nov 27, 2017 at 11:39 AM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> I see. There is no JSON interface then, clients have to use thrift?
>
> On Mon, Nov 27, 2017 at 10:50 AM, Bill Farner <wfar...@apache.org> wrote:
>
>> I suspect you are looking at /apibeta, which serves a javadoc-style doc
>> from a GET request.  There is no support for this interface, however, and
>> it is subject to removal in the future.  That being said, you can see an
>> example of querying by task status in this test case
>> <https://github.com/apache/aurora/blob/0f3dc939e2af1fb5751109e0ff7b6a0f7df70ac0/src/test/java/org/apache/aurora/scheduler/http/api/ApiBetaTest.java#L153-L163>
>> .
>>
>> On Mon, Nov 27, 2017 at 10:31 AM, Renan DelValle <
>> renanidelva...@gmail.com> wrote:
>>
>>> Hi Mohit,
>>>
>>> I think it would be useful if you could include a link to the
>>> documentation on the UI you're talking about. Looking at the source code,
>>> the UI uses Thrift via javascript to query the scheduler for the info it
>>> needs for each page. As far as I know, there is no way to query the
>>> scheduler for task status without using Thrift at this moment.
>>>
>>> -Renan
>>>
>>> On Sun, Nov 26, 2017 at 2:19 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>> wrote:
>>>
>>>> Hi,
>>>> I am looking for some examples of HTTP API calls to the scheduler using
>>>> JSON (not Thrift). I would like to use that to query tasks with a given
>>>> status etc. I can see some documentation on the scheduler UI but it is not
>>>> clear without examples how to use the API using JSON.
>>>>
>>>> Mohit.
>>>>
>>>
>>>
>>
>


Re: HTTP API examples

2017-11-27 Thread Bill Farner
I suspect you are looking at /apibeta, which serves a javadoc-style doc
from a GET request.  There is no support for this interface, however, and
it is subject to removal in the future.  That being said, you can see an
example of querying by task status in this test case

.

On Mon, Nov 27, 2017 at 10:31 AM, Renan DelValle 
wrote:

> Hi Mohit,
>
> I think it would be useful if you could include a link to the
> documentation on the UI you're talking about. Looking at the source code,
> the UI uses Thrift via javascript to query the scheduler for the info it
> needs for each page. As far as I know, there is no way to query the
> scheduler for task status without using Thrift at this moment.
>
> -Renan
>
> On Sun, Nov 26, 2017 at 2:19 PM, Mohit Jaggi  wrote:
>
>> Hi,
>> I am looking for some examples of HTTP API calls to the scheduler using
>> JSON (not Thrift). I would like to use that to query tasks with a given
>> status etc. I can see some documentation on the scheduler UI but it is not
>> clear without examples how to use the API using JSON.
>>
>> Mohit.
>>
>
>


Re: Apache Aurora holding resources which makes other framework starve

2017-11-26 Thread Bill Farner
The file to edit should indeed be /etc/default/aurora-scheduler,
specifically by populating EXTRA_SCHEDULER_ARGS:

EXTRA_SCHEDULER_ARGS="-min_offer_hold_time=30secs"

On Sun, Nov 26, 2017 at 6:57 AM, bigggyan  wrote:

> Thanks Mohit for the information.
> I have installed Apache Aurora as described in the Aurora installation
> page. I am not using puppet. *It is mentioned in the "scheduler
> configuration"  page that we need to set the *-min_offer_hold_time
> variable. I am not very sure in which config file I need to add this
> parameter. I have tried in /etc/default/aurora-scheduler and
> /etc/default/cluster.json to add this additional parameter, but did not
> take effect.
> It will be great help if you can point me to some config file where I need
> to make the changes in 0.19 aurora installation on Ubuntu 16.
>
>
>
> On Sat, Nov 25, 2017 at 11:48 PM, Mohit Jaggi 
> wrote:
>
>> Command line params on Aurora and Mesos control this. The "config file" for 
>> this may depend on how your cluster is managed. It can be in puppet 
>> manifest, for example. See below for the parameters. Docs are 
>> http://mesos.apache.org/documentation/latest/configuration/master/ and 
>> http://aurora.apache.org/documentation/latest/reference/scheduler-configuration/
>>
>> On Aurora:
>>
>> -min_offer_hold_time (default (5, mins))
>> Minimum amount of time to hold a resource offer before declining
>>
>> -offer_filter_duration (default (5, secs))
>> Duration after which we expect Mesos to re-offer unused resources. A 
>> short duration improves scheduling performance in smaller clusters, but 
>> might lead to resource starvation for other frameworks if you run many 
>> frameworks in your cluster.
>> -offer_hold_jitter_window (default (1, mins))
>> Maximum amount of random jitter to add to the offer hold time window.
>>
>>
>> On Mesos:
>> --offer_timeout=VALUE Duration of time before an offer is rescinded from
>> a framework. This helps fairness when running frameworks that hold on to
>> offers, or frameworks that accidentally drop offers. If not set, offers do
>> not timeout.
>>
>> On Sat, Nov 25, 2017 at 7:49 PM, bigggyan  wrote:
>>
>>> Hello Everyone,
>>>
>>> I am using Aurora along with other in-house frameworks and could see
>>> Aurora is holding resource offers for 3 mins which put other frameworks in
>>> starvation. Can anyone please suggest where to make the configuration
>>> changes to reduce the time? If possible please specify the config file
>>> location where I can make changes to change the parameter.
>>>
>>> Thanks
>>> Biggyan
>>>
>>
>>
>


Re: reverting logback dependency update

2017-11-20 Thread Bill Farner
Aha.  Yes, i suspect you will be fine to revert these locally.

On Mon, Nov 20, 2017 at 7:11 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> I should have been clear. I meant if I change it in my fork, should I
> expect it to work? Or is there a change later in 0.18.1 that relies on the
> version being new?
>
> Sent from my iPhone
>
> On Nov 20, 2017, at 6:42 PM, Bill Farner <wfar...@apache.org> wrote:
>
> I don't think it is fair to the community or practical to hold back
> library versions because of conflicts in proprietary custom builds of
> Aurora.  So in general, i am -1 on the precedent this would set.
>
> On Mon, Nov 20, 2017 at 5:53 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:
>
>> Folks,
>> Due to a conflict with another tool we use, I can't use logback 1.2.3 and
>> slf4j 1.7.25 yet. Is it safe to change them to the previous values?
>>
>> Ref: https://github.com/apache/aurora/commit/d7425aa56d3fba9
>> 8f4a16cb93bff8f9ce7ce0e67
>>
>> Mohit.
>>
>
>


Re: reverting logback dependency update

2017-11-20 Thread Bill Farner
I don't think it is fair to the community or practical to hold back library
versions because of conflicts in proprietary custom builds of Aurora.  So
in general, i am -1 on the precedent this would set.

On Mon, Nov 20, 2017 at 5:53 PM, Mohit Jaggi  wrote:

> Folks,
> Due to a conflict with another tool we use, I can't use logback 1.2.3 and
> slf4j 1.7.25 yet. Is it safe to change them to the previous values?
>
> Ref: https://github.com/apache/aurora/commit/
> d7425aa56d3fba98f4a16cb93bff8f9ce7ce0e67
>
> Mohit.
>


Re: Aurora pauses adding offers

2017-11-13 Thread Bill Farner
The next level is to determine why the storage lock is being held.  Common
causes include:

1. storage snapshot slowness, when scheduler state is very large, O(gb)
1a. long GC pauses in the scheduler, often induced by (1)
2. scheduler replicated log on slow disks
3. network issues between schedulers, schedulers to zookeeper, or between
zookeepers

As an immediate (partial) remedy, i suggest you upgrade to eliminate the
use of SQL/mybatis in the scheduler.  This helped twitter improve (1) and
(1a).

commit f2755e1
Author: Bill Farner <wfar...@apache.org>
Date:   Tue Oct 24 23:34:09 2017 -0700

Exclusively use Map-based in-memory stores for primary storage


On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> and in log_storage_write_lock_wait_ns_per_event
>
> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:
>
>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that
>> cause or effect? :-)
>>
>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mohit.ja...@uber.com>
>> wrote:
>>
>>> Thanks Bill. Please see inline:
>>>
>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wfar...@apache.org> wrote:
>>>
>>>> I suspect they are getting enqueued
>>>>
>>>>
>>>> Just to be sure - the offers do eventually get through though?
>>>>
>>>>
>>> In one instance the offers did get through but it took several minutes.
>>> In other instances we reloaded the scheduler to let another one become the
>>> leader.
>>>
>>>
>>>> The most likely culprit is contention for the storage write lock,  
>>>> observable
>>>> via spikes in stat log_storage_write_lock_wait_ns_total.
>>>>
>>>
>>> Thanks. I will check that one.
>>>
>>>
>>>>
>>>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>>>>> calls are being made at that time
>>>>
>>>>
>>>> This sounds like API activity.  This shouldn't interfere with offer
>>>> processing directly, but could potentially slow down the scheduler as a
>>>> whole.
>>>>
>>>>
>>> So these won't contend for locks with offer processing and task
>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>> machine. I also noticed a spike in mybatis active and bad connections.
>>> Can't say if the spike in active is due to many bad connections or vice
>>> versa or there was a 3rd source causing both of these. Are there any
>>> metrics or logs that might help here?
>>>
>>>
>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>>> time. Can this happen if task is in PENDING state and does not reach
>>>>> ASSIGNED due to lack of offers?
>>>>
>>>>
>>>> This is unusual.  Pending tasks are not timed out; this applies to
>>>> tasks in states where the scheduler is waiting for something else to act
>>>> and it does not hear back (via a status update).
>>>>
>>>
>>> Perhaps they were in ASSIGNED or some other state. If updates from Mesos
>>> are being delayed or processed too slowly both these effects will occur?
>>>
>>>
>>>>
>>>> I suggest digging into the cause of delayed offer processing first, i
>>>> suspect it might be related to the task timeouts as well.
>>>>
>>>> version close to 0.18.0
>>>>
>>>>
>>>> Is the ambiguity is due to custom patches?  Can you at least indicate
>>>> the last git SHA off aurora/master?  Digging much deeper in diagnosing this
>>>> may prove tricky without knowing what code is in play.
>>>>
>>>>
>>>-
>>>- c85bffd
>>>- 10 i
>>>-
>>>- s the commit from which we forked.
>>>-
>>>-
>>>-
>>>- The custom patch is mainly the dynamic reservation work done by
>>>Dmitri. We also have commits for offer/rescind race issue, setrootfs 
>>> patch
>>>(which is not upstreamed yet).
>>>-
>>>-
>>>-
>>>-
>>>-
>>>
>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>
>>>
>>>
>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>> wrote:
>>>>
>>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>>> time. Can this happen if task is in PENDING state and does not reach
>>>>> ASSIGNED due to lack of offers?
>>>>>
>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>> wrote:
>>>>>
>>>>>> Folks,
>>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I 
>>>>>> tail
>>>>>> the logs I can see offers are coming in. I suspect they are getting
>>>>>> enqueued for processing by "executor" but stay there for a long time and
>>>>>> are not processed either due to locking or thread starvation.
>>>>>>
>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>> should I be looking for to debug this condition further?
>>>>>>
>>>>>> Mohit.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


[ANNOUNCE] 0.19.0 release

2017-11-11 Thread Bill Farner
Hello folks,

Aurora 0.19.0 has been released!  Please see the blog post for more
details: https://aurora.apache.org/blog/aurora-0-19-0-released/


Cheers,

Bill


Re: Aurora pauses adding offers

2017-11-10 Thread Bill Farner
>
> I suspect they are getting enqueued


Just to be sure - the offers do eventually get through though?

The most likely culprit is contention for the storage write lock,  observable
via spikes in stat log_storage_write_lock_wait_ns_total.

I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
> calls are being made at that time


This sounds like API activity.  This shouldn't interfere with offer
processing directly, but could potentially slow down the scheduler as a
whole.

I also notice a lot of "Timeout reached for task..." around the same time.
> Can this happen if task is in PENDING state and does not reach ASSIGNED due
> to lack of offers?


This is unusual.  Pending tasks are not timed out; this applies to tasks in
states where the scheduler is waiting for something else to act and it does
not hear back (via a status update).

I suggest digging into the cause of delayed offer processing first, i
suspect it might be related to the task timeouts as well.

version close to 0.18.0


Is the ambiguity is due to custom patches?  Can you at least indicate the
last git SHA off aurora/master?  Digging much deeper in diagnosing this may
prove tricky without knowing what code is in play.


On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi  wrote:

> I also notice a lot of "Timeout reached for task..." around the same time.
> Can this happen if task is in PENDING state and does not reach ASSIGNED due
> to lack of offers?
>
> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi  wrote:
>
>> Folks,
>> I have noticed some weird behavior in Aurora (version close to 0.18.0).
>> Sometimes, it shows no offers in the UI offers page. But if I tail the logs
>> I can see offers are coming in. I suspect they are getting enqueued for
>> processing by "executor" but stay there for a long time and are not
>> processed either due to locking or thread starvation.
>>
>> I see that a lot of getJobUpdateDetails() and getTasksWithoutConfigs()
>> calls are being made at that time. Could these calls starve the
>> OfferManager(e.g. by contending for some lock)? What should I be looking
>> for to debug this condition further?
>>
>> Mohit.
>>
>
>


[CVE-2016-4437] Apache Aurora information disclosure vulnerability (amended)

2017-11-01 Thread Bill Farner
Please see below for the amended notice.  The prior announcement indicated
that releases prior to 0.10.0 were unaffected, which is incorrect.  Version
0.8.0 - 0.18.0 included vulnerable shiro versions.

Versions Affected:
Aurora 0.8.0 - 0.18.0

Description:
The affected versions of the scheduler rely on a version of Apache Shiro
which is vulnerable to CVE-2016-4437.  Under certain conditions, the
vulnerability allows remote attackers to execute arbitrary code or bypass
intended access restrictions via an unspecified request parameter.

Mitigation:
0.18.0 and earlier users should upgrade to 0.18.1 or apply this patch
https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=commit;h=ec640117
Alternatively, INI configuration mitigations outlined in CVE-2016-4437
may be applied.

Credit:
This issue was discovered by Greg Harris from the Fitbit Security team.


[CVE-2016-4437] Apache Aurora information disclosure vulnerability

2017-11-01 Thread Bill Farner
Versions Affected:
Aurora 0.10.0 to 0.18.0

Description:
The affected versions of the scheduler rely on a version of Apache Shiro
which is vulnerable to CVE-2016-4437.  Under certain conditions, the
vulnerability allows remote attackers to execute arbitrary code or bypass
intended access restrictions via an unspecified request parameter.

Mitigation:
0.18.0 users should upgrade to 0.18.1
0.10.0 - 0.17.0 users should upgrade to 0.18.1 or apply this patch
https://git-wip-us.apache.org/repos/asf?p=aurora.git;a=commit;h=ec640117
Alternatively, INI configuration mitigations outlined in CVE-2016-4437
may be applied.

Credit:
This issue was discovered by Greg Harris from the Fitbit Security team.


[ANNOUNCE] 0.18.1 release

2017-11-01 Thread Bill Farner
Hello folks,

I'm pleased to announce that Apache Aurora 0.18.1 has been released!

More details can be found in the blog post:
https://aurora.apache.org/blog/aurora-0-18-1-released/


Cheers,

Bill


Re: distinguishing failure types during upgrade

2017-11-01 Thread Bill Farner
>
> How does rollback work in that case


Rollback behavior is unchanged when update pulses are enabled.

disable auto-rollback


That's also a feasible option.

On Wed, Nov 1, 2017 at 9:15 AM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> Signal =
> - exit status from service
> - reason code from mesos, it task was killed by Mesos e.g. revocable core
> revoked during oversubscription
>
> Yes, I am aware of co-ordinated updates which allow this logic to be
> placed outside Aurora. How does rollback work in that case? Perhaps I
> should just disable auto-rollback in that case and out the rollback logic
> also into this external system.
>
> On Wed, Nov 1, 2017 at 8:39 AM, Bill Farner <wfar...@apache.org> wrote:
>
>> Can Aurora distinguish between failures caused by the upgrade itself or
>>> other transient systemic issues
>>
>>
>> There isn't any signal i know of that would allow Aurora to independently
>> determine the cause of task failures in a generic way.
>>
>> Two options come to mind:
>> 1. Human intervention - aurora update pause from the CLI
>> 2. Configure jobs to use JobUpdateSettings.blockIfNoPulsesAfterMs
>> <https://github.com/apache/aurora/blob/d106b4ecc9537b8e844c4edc2210b9fe1853ccc4/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L708-L714>,
>> and set up an in-house service to invoke pulseJobUpdate()
>> <https://github.com/apache/aurora/blob/d106b4ecc9537b8e844c4edc2210b9fe1853ccc4/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L1134-L1139>.
>> This opts the job update into requiring periodic positive acknowledgement
>> from an external system that it is safe to proceed.  You could use this,
>> for example, to automatically gate an update while a service has alerts
>> firing.
>>
>>
>>
>> On Tue, Oct 31, 2017 at 1:14 PM, Mohit Jaggi <mohit.ja...@uber.com>
>> wrote:
>>
>>> Folks,
>>> Sometimes in our cluster upgrades start failing due to transient outages
>>> of dependencies or reasons unrelated to the new code being pushed out.
>>> Aurora hits its failure threshold and starts automatic rollback which may
>>> make a bad condition worse (e.g. if the outage was related to load rollback
>>> will increase load). Can Aurora distinguish between failures caused by the
>>> upgrade itself or other transient systemic issues (using e.g. reason code)?
>>> If not does this make sense as a new feature?
>>>
>>> Mohit.
>>>
>>>
>>
>


Re: distinguishing failure types during upgrade

2017-11-01 Thread Bill Farner
>
> Can Aurora distinguish between failures caused by the upgrade itself or
> other transient systemic issues


There isn't any signal i know of that would allow Aurora to independently
determine the cause of task failures in a generic way.

Two options come to mind:
1. Human intervention - aurora update pause from the CLI
2. Configure jobs to use JobUpdateSettings.blockIfNoPulsesAfterMs
,
and set up an in-house service to invoke pulseJobUpdate()
.
This opts the job update into requiring periodic positive acknowledgement
from an external system that it is safe to proceed.  You could use this,
for example, to automatically gate an update while a service has alerts
firing.



On Tue, Oct 31, 2017 at 1:14 PM, Mohit Jaggi  wrote:

> Folks,
> Sometimes in our cluster upgrades start failing due to transient outages
> of dependencies or reasons unrelated to the new code being pushed out.
> Aurora hits its failure threshold and starts automatic rollback which may
> make a bad condition worse (e.g. if the outage was related to load rollback
> will increase load). Can Aurora distinguish between failures caused by the
> upgrade itself or other transient systemic issues (using e.g. reason code)?
> If not does this make sense as a new feature?
>
> Mohit.
>
>


Re: updateconfig doc

2017-10-30 Thread Bill Farner
Correct!

On Mon, Oct 30, 2017 at 2:32 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> Got it...and wait_for_batch_completion changes this from a "sliding" to a
> "rolling" window ?
>
> On Mon, Oct 30, 2017 at 2:28 PM, Bill Farner <wfar...@apache.org> wrote:
>
>> Joshua beat me to the reply, so now you have corroboration for his
>> correction :-)
>>
>> On Mon, Oct 30, 2017 at 2:26 PM, Bill Farner <wfar...@apache.org> wrote:
>>
>>> Clarification - shard and instance are (unfortunately) used
>>> interchangeably in some of our docs, despite the fact that shard can have a
>>> different meaning in other contexts.
>>>
>>> The meaning of batch_size doesn't match either rephrasing you offer,
>>> perhaps the docs need work!  batch_size effectively tells the updater what
>>> portion of your service may be down in the course of the update.  This
>>> becomes the size of a sliding window as the update proceeds across the
>>> instances of the service.
>>>
>>> i.e. if batch_size is 3, the updater will start updating 3 instances
>>> immediately, and proceed through all instances with 3 instances updating a
>>> any time until it reaches the end.
>>>
>>> Does that clarify?
>>>
>>> On Mon, Oct 30, 2017 at 2:01 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>> wrote:
>>>
>>>> Folks,
>>>> Does the following doc mean A or B?
>>>>
>>>> *A*: batch_size is the number of instances in a given shard
>>>> *B:* batch_size is the number of shards. So every batch has (number of
>>>> instances)/(batch_size) tasks.
>>>>
>>>> Mohit.
>>>> UpdateConfig Objects
>>>>
>>>> Parameters for controlling the rate and policy of rolling updates.
>>>> objecttypedescription
>>>> batch_size Integer Maximum number of shards to be updated in one
>>>> iteration (Default: 1)
>>>>
>>>
>>>
>>
>


Re: updateconfig doc

2017-10-30 Thread Bill Farner
Joshua beat me to the reply, so now you have corroboration for his
correction :-)

On Mon, Oct 30, 2017 at 2:26 PM, Bill Farner <wfar...@apache.org> wrote:

> Clarification - shard and instance are (unfortunately) used
> interchangeably in some of our docs, despite the fact that shard can have a
> different meaning in other contexts.
>
> The meaning of batch_size doesn't match either rephrasing you offer,
> perhaps the docs need work!  batch_size effectively tells the updater what
> portion of your service may be down in the course of the update.  This
> becomes the size of a sliding window as the update proceeds across the
> instances of the service.
>
> i.e. if batch_size is 3, the updater will start updating 3 instances
> immediately, and proceed through all instances with 3 instances updating a
> any time until it reaches the end.
>
> Does that clarify?
>
> On Mon, Oct 30, 2017 at 2:01 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:
>
>> Folks,
>> Does the following doc mean A or B?
>>
>> *A*: batch_size is the number of instances in a given shard
>> *B:* batch_size is the number of shards. So every batch has (number of
>> instances)/(batch_size) tasks.
>>
>> Mohit.
>> UpdateConfig Objects
>>
>> Parameters for controlling the rate and policy of rolling updates.
>> objecttypedescription
>> batch_size Integer Maximum number of shards to be updated in one
>> iteration (Default: 1)
>>
>
>


Re: updateconfig doc

2017-10-30 Thread Bill Farner
Clarification - shard and instance are (unfortunately) used interchangeably
in some of our docs, despite the fact that shard can have a different
meaning in other contexts.

The meaning of batch_size doesn't match either rephrasing you offer,
perhaps the docs need work!  batch_size effectively tells the updater what
portion of your service may be down in the course of the update.  This
becomes the size of a sliding window as the update proceeds across the
instances of the service.

i.e. if batch_size is 3, the updater will start updating 3 instances
immediately, and proceed through all instances with 3 instances updating a
any time until it reaches the end.

Does that clarify?

On Mon, Oct 30, 2017 at 2:01 PM, Mohit Jaggi  wrote:

> Folks,
> Does the following doc mean A or B?
>
> *A*: batch_size is the number of instances in a given shard
> *B:* batch_size is the number of shards. So every batch has (number of
> instances)/(batch_size) tasks.
>
> Mohit.
> UpdateConfig Objects
>
> Parameters for controlling the rate and policy of rolling updates.
> objecttypedescription
> batch_size Integer Maximum number of shards to be updated in one
> iteration (Default: 1)
>


Re: Lost framework registered event [Was Re: leader election issues]

2017-10-26 Thread Bill Farner
FYI - i believe we stumbled on this issue, which is greatly exacerbated by
f2755e1
<https://github.com/apache/aurora/commit/f2755e1cdd67f3c1516726c21d6e8f13059a5a01>.
The good news is that we now have a good handle on the culprit!  More
details at https://issues.apache.org/jira/browse/AURORA-1953

On Thu, Sep 28, 2017 at 2:14 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> Hmm...it is a very busy cluster and 10 mins of logs will be voluminous.
> They contain some internal details which I cannot share publicly. If you
> suspect specific areas, I can try to get those logs and remove internal
> info.
>
> Re: code, we have a fork which is very close to master.
>
> On Wed, Sep 27, 2017 at 10:03 PM, Bill Farner <wfar...@apache.org> wrote:
>
>> What commit/release was this with?  From the looks of the log contents,
>> it's not master.  I'd like to make sure i'm looking at the correct code.
>>
>> Are there more logs not being included?  If so, can you share more
>> complete logs?  In particular, logs during the 10 minute delay would be
>> particularly helpful.
>>
>> On Tue, Sep 26, 2017 at 11:51 PM, Mohit Jaggi <mohit.ja...@uber.com>
>> wrote:
>>
>>> Updating subject...as it looks like leader election was fine but
>>> registration ack did not make it to the SchedulerLifecycle code. Weird that
>>> an event will get lost like that.
>>>
>>> On Tue, Sep 26, 2017 at 4:21 PM, John Sirois <john.sir...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Sep 26, 2017 at 4:33 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>> wrote:
>>>>
>>>>> John,
>>>>> I was referring to the following log message...isn't that the right
>>>>> one?
>>>>>
>>>>
>>>> Aha - it is, apologies.
>>>>
>>>>
>>>>> Sep 26 18:11:58 machine62 aurora-scheduler[24743]: I0926 18:11:58.795
>>>>>  [Thread-814, MesosCallbackHandler$MesosCallbackHandlerImpl:180]
>>>>> Registered with ID value: "4ca9aa06-3214-4d2c-a678-0832e2f84d17-"
>>>>>
>>>>> On Tue, Sep 26, 2017 at 3:30 PM, John Sirois <john.sir...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> > ... but it succeeded(see logs at the end) ...
>>>>>>
>>>>>> The underlying c++ code in the scheduler driver successfully
>>>>>> connected to the leading master socket:
>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master
>>>>>> at http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>
>>>>>> <http://10.163.25.45:5050/master/api/v1/scheduler>This is not the
>>>>>> same as a framework registration call being successfully executed against
>>>>>> the newly connected master.
>>>>>> You need to be careful about what you derive from the logs just based
>>>>>> on a reading of the words. Generally you'll need to look carefully / grep
>>>>>> sourcecode to be sure you are mentally modelling the code flows 
>>>>>> correctly.
>>>>>> It certainly gets tricky.
>>>>>>
>>>>>> On Tue, Sep 26, 2017 at 4:14 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hmmso it is indeed the mesos registration, but it succeeded(see
>>>>>>> logs at the end). Looking at the code, I see that
>>>>>>> com.google.common.eventbus  is used as pub-sub mechanism to link
>>>>>>> the registration (first log message)  to the  registrationAcked 
>>>>>>> flag...and
>>>>>>> this flag is not being set for 10mins (?) otherwise the registration
>>>>>>> timeout handler will not print the second log message.
>>>>>>>
>>>>>>> delayedActions.onRegistrationTimeout(
>>>>>>> () -> {
>>>>>>>   if (!registrationAcked.get()) {
>>>>>>> LOG.error(
>>>>>>> "Framework has not been registered within the tolerated 
>>>>>>> delay.");
>>>>>>> stateMachine.transition(State.DEAD);
>>>>>>>   }
>>>>>>> });
>>>>>>>
>>>>>

Re: orphaned thermos

2017-10-26 Thread Bill Farner
If the executor runs out of memory, i think it should be assumed that it
will no longer be well-behaved.  It seems most sensible for the mesos agent
to clean up in this case.

On Thu, Oct 26, 2017 at 11:56 AM, Mohit Jaggi  wrote:

> We found several zombie executors on a cluster. Thermos logs indicate
> reaching system limits while trying to shutdown(?). Mesos agent is unable
> to get status of this container from docker daemon (docker inspect fails).
> Shouldn't thermos exit in such a case?
>
>
>  22 WARNING: Your kernel does not support swap limit capabilities, memory 
> limited without swap.
>  23 twitter.common.app debug: Initializing: twitter.common.log (Logging 
> subsystem.)
>  24 Writing log files to disk in /mnt/mesos/sandbox
>  25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
>  26 I1023 19:04:32.26487042 exec.cpp:237] Executor registered on agent 
> b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>  27 Writing log files to disk in /mnt/mesos/sandbox
>  28 Traceback (most recent call last):
>  29   File 
> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 126, in _excepting_run
>  30 self.__real_run(*args, **kw)
>  31   File "apache/thermos/monitoring/resource.py", line 243, in run
>  32   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py",
>  line 79, in wait
>  33 thread.start()
>  34   File "/usr/lib/python2.7/threading.py", line 745, in start
>  35 _start_new_thread(self.__bootstrap, ())
>  36 thread.error: can't start new thread
>  37 ERROR] *Failed to stop health checkers:*
>  38 ERROR] Traceback (most recent call last):
>  39   File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>  40 propagate_deadline(self._chained_checker.stop, 
> timeout=self.STOP_TIMEOUT)
>  41   File "apache/aurora/executor/aurora_executor.py", line 35, in 
> propagate_deadline
>  42 return deadline(*args, daemon=True, propagate=True, **kw)
>  43   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>  line 61, in deadline
>  44 AnonymousThread().start()
>  45   File "/usr/lib/python2.7/threading.py", line 745, in start
>  46 _start_new_thread(self.__bootstrap, ())
>  47 *error: can't start new thread*
>
> 48
>
>  49 ERROR]* Failed to stop runner:*
> 50 ERROR] Traceback (most recent call last):
>  51   File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>  52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>  53   File "apache/aurora/executor/aurora_executor.py", line 35, in 
> propagate_deadline
>  54 return deadline(*args, daemon=True, propagate=True, **kw)
>  55   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>  line 61, in deadline
>  56 AnonymousThread().start()
>  57   File "/usr/lib/python2.7/threading.py", line 745, in start
>  58 _start_new_thread(self.__bootstrap, ())
>  59 *error: can't start new thread
> * 60
>  61 Traceback (most recent call last):
>  62   File 
> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 126, in _excepting_run
>  63 self.__real_run(*args, **kw)
>  64   File "apache/aurora/executor/status_manager.py", line 62, in run
>  65   File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>  66   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py",
>  line 56, in defer
>  67 deferred.start()
>  68   File "/usr/lib/python2.7/threading.py", line 745, in start
>  69 _start_new_thread(self.__bootstrap, ())
>  70* thread.error: can't start new thread*
>
>


Re: fix for aurora-1945

2017-10-02 Thread Bill Farner
>
> Is there a place where we store "used" offers


Once the scheduler has decided to use an offer, the offer is removed
<https://github.com/apache/aurora/blob/6fd6d50288d6155fda212cecab539c5fb669a9aa/src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java#L418>
from OfferManager.  The situation you describe is indeed possible.
However, i don't suspect that there's much to gain from trying to mitigate
an accept/rescind race.  That type of race also wouldn't cause any state
integrity issues in the scheduler, which is what this 'global ban' routine
was intending ti plug.

On Mon, Oct 2, 2017 at 4:49 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> I was wondering if that case can be checked and the banning skipped. Is
> there a place where we store "used" offers? At first glance it looks like
> there isn't...perhaps deep down in job/task state but that will be too
> expensive to check.
>
> On Mon, Oct 2, 2017 at 4:44 PM, Bill Farner <wfar...@apache.org> wrote:
>
>> That's true, but it doesn't appear the comment is trying to lay out all
>> possible scenarios.  Instead, it is attempting to explain the rationale for
>> offerManager.banOffer(offerId) a few lines later.
>>
>> On Mon, Oct 2, 2017 at 4:30 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote:
>>
>>> Folks,
>>> In the code below, isn't there a 3rd case where the offer was "used" and
>>> hence not found for canceling?
>>>
>>> Mohit.
>>>
>>> public void handleRescind(OfferID offerId) {
>>>   log.info("Offer rescinded: {}", offerId.getValue());
>>>
>>>   // For rescinds, we want to ensure they are processed quickly before we 
>>> attempt to use an
>>>   // invalid offer. There are a few scenarios we want to be aware of:
>>>   //   1. We receive an offer, add it to OfferManager, and then get a 
>>> rescind. In this scenario,
>>>   //  we can just remove the offer from the offers list.
>>>   //   2. We receive an offer, but before we add it to the OfferManager 
>>> list we get a rescind.
>>>   //  In this scenario, we want to ensure that we do not use it/accept 
>>> it when the executor
>>>   //  finally processes the offer. We will temporarily ban it and add a 
>>> command for the
>>>   //  executor to unban it so future offers can be processed normally.
>>>   boolean offerCancelled = offerManager.cancelOffer(offerId);
>>>
>>>
>>
>


Re: aurora crash in PendingTaskProcessor

2017-09-29 Thread Bill Farner
>
> concurrent map


I'm looking at this chunk here
<https://github.com/apache/aurora/blob/7a803730c95fc7d1f788292d83c3d2eeb81a936d/src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java#L193-L202>,
where a concurrent map would not help.

  Optional sameSlave =
hostOffers.get(offer.getOffer().getAgentId());
  if (sameSlave.isPresent()) {
// If there are existing offers for the slave, decline all of them
so the master can
// compact all of those offers into a single offer and send them
back.
LOG.info("Returning offers for " +
offer.getOffer().getAgentId().getValue()
+ " for compaction.");
decline(offer.getOffer().getId());
removeAndDecline(sameSlave.get().getOffer().getId());
  } else {
hostOffers.add(offer);

This exhibits a classic check-then-act race on hostOffers, which could
allow a second offer with the same agent ID.  An obvious fix here would be
to move the "if exists, remove, else add" sequence in a synchronized method
in hostOffers.

Happy to help guide you on a patch!



On Fri, Sep 29, 2017 at 9:57 AM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> Will do. If the fix involves making the map of offers by agent id a
> concurrent map...I can contribute that.
>
> On Fri, Sep 29, 2017 at 9:09 AM, Bill Farner <wfar...@apache.org> wrote:
>
>> This is due to multiple offers for the same agent, rather than duplicate
>> offers.  I don't see a specific bug in the suspect code
>> (OfferManager.java), but it does stand out as subject to races.
>> Specifically, there is a lack of synchronization when checking for an offer
>> exists for a given agent ID and subsequently removing that offer.
>>
>> Can you file a bug?
>>
>> On Thu, Sep 28, 2017 at 1:56 PM, Mohit Jaggi <mohit.ja...@uber.com>
>> wrote:
>>
>>> Folks,
>>>
>>> I saw the following crash in my scheduler. It appears to be due to
>>> duplicates offers. Any insights appreciated!
>>>
>>> Mohit.
>>>
>>> *Code:*
>>>
>>> https://github.com/apache/aurora/blob/master/src/main/java/o
>>> rg/apache/aurora/scheduler/preemptor/PendingTaskProcessor.java#L145
>>>
>>> *Logs:*
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: Sep 28, 2017
>>> 6:09:00 PM com.google.common.util.concurrent.ServiceManager$ServiceListener
>>> failed
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: SEVERE: Service
>>> PreemptorService [FAILED] has failed in the RUNNING state.
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]:
>>> java.lang.IllegalArgumentException: Multiple entries with same key:
>>> 1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102=HostOffer{offer=id {
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: value:
>>> "f7b84805-a0c5-4405-be77-f7f1b7110405-O56597202"
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: }
>>>
>>>
>>> ...
>>>
>>> ...
>>>
>>>
>>> ep 28 18:09:00 machine1163 aurora-scheduler[14266]: ,
>>> hostAttributes=IHostAttributes{host=compute606-dca1.prod.uber.internal,
>>> attributes=[IAttribute{name=host, values=[compute606-dca1]},
>>> IAttribute{name=rack, values=[as13]}, IAttribute{name=pod, values=[d]},
>>> IAttribute{name=dedicated, values=[infra/cassandra]}], mode=NONE,
>>> slaveId=1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102}}. To index multiple
>>> values under a key, use Multimaps.index.
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> com.google.common.collect.Maps.uniqueIndex(Maps.java:1251)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> com.google.common.collect.Maps.uniqueIndex(Maps.java:1208)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.l
>>> ambda$run$0(PendingTaskProcessor.java:146)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStor
>>> age.java:147)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
>>> org.mybatis.guice.transactional.TransactionalMethodIntercept
>>> or.invoke(TransactionalMethodInterceptor.java:101)
>>>
>>>
>>> Sep 28 18:09:00 machine1163 aurora-scheduler[1426

Re: aurora crash in PendingTaskProcessor

2017-09-29 Thread Bill Farner
This is due to multiple offers for the same agent, rather than duplicate
offers.  I don't see a specific bug in the suspect code
(OfferManager.java), but it does stand out as subject to races.
Specifically, there is a lack of synchronization when checking for an offer
exists for a given agent ID and subsequently removing that offer.

Can you file a bug?

On Thu, Sep 28, 2017 at 1:56 PM, Mohit Jaggi  wrote:

> Folks,
>
> I saw the following crash in my scheduler. It appears to be due to
> duplicates offers. Any insights appreciated!
>
> Mohit.
>
> *Code:*
>
> https://github.com/apache/aurora/blob/master/src/main/
> java/org/apache/aurora/scheduler/preemptor/PendingTaskProcessor.java#L145
>
> *Logs:*
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: Sep 28, 2017 6:09:00
> PM com.google.common.util.concurrent.ServiceManager$ServiceListener failed
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: SEVERE: Service
> PreemptorService [FAILED] has failed in the RUNNING state.
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: 
> java.lang.IllegalArgumentException:
> Multiple entries with same key: 1ed038e0-a3ef-4476-adfd-
> 70c86241c5f7-S102=HostOffer{offer=id {
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: value:
> "f7b84805-a0c5-4405-be77-f7f1b7110405-O56597202"
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: }
>
>
> ...
>
> ...
>
>
> ep 28 18:09:00 machine1163 aurora-scheduler[14266]: , hostAttributes=
> IHostAttributes{host=compute606-dca1.prod.uber.internal,
> attributes=[IAttribute{name=host, values=[compute606-dca1]},
> IAttribute{name=rack, values=[as13]}, IAttribute{name=pod, values=[d]},
> IAttribute{name=dedicated, values=[infra/cassandra]}], mode=NONE,
> slaveId=1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102}}. To index multiple
> values under a key, use Multimaps.index.
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> com.google.common.collect.Maps.uniqueIndex(Maps.java:1251)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> com.google.common.collect.Maps.uniqueIndex(Maps.java:1208)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.lambda$
> run$0(PendingTaskProcessor.java:146)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStorage.java:147)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at org.mybatis.guice.
> transactional.TransactionalMethodInterceptor.invoke(
> TransactionalMethodInterceptor.java:101)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.common.inject.TimedInterceptor.
> invoke(TimedInterceptor.java:83)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.scheduler.storage.log.LogStorage.read(
> LogStorage.java:562)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.read(
> CallOrderEnforcingStorage.java:113)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.run(
> PendingTaskProcessor.java:135)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.common.inject.TimedInterceptor.
> invoke(TimedInterceptor.java:83)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> org.apache.aurora.scheduler.preemptor.PreemptorModule$PreemptorService.
> runOneIteration(PreemptorModule.java:161)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> com.google.common.util.concurrent.AbstractScheduledService$
> ServiceDelegate$Task.run(AbstractScheduledService.java:188)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(
> ScheduledThreadPoolExecutor.java:294)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at
> java.lang.Thread.run(Thread.java:748)
>
>
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: E0928 18:09:00.316
> [PreemptorService 

Re: Lost framework registered event [Was Re: leader election issues]

2017-09-27 Thread Bill Farner
gt; On Tue, Sep 26, 2017 at 2:27 PM, Renan DelValle Rueda <
>>>>>>> renanidelva...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I've had this, or something very similar, happen before. It's an
>>>>>>>> issue with Aurora and ZK. Election is based upon ZK, so if writing 
>>>>>>>> down who
>>>>>>>> the leader is to the ZK server path fails, or if ZK is unable to reach
>>>>>>>> quorum on the write, the election will fail. Sometimes this might 
>>>>>>>> manifest
>>>>>>>> itself in weird ways, such as two aurora schedulers believing they are
>>>>>>>> leaders. If you could tell us a little bit about your ZK set up we 
>>>>>>>> might be
>>>>>>>> able to narrow down the issue. Also, Aurora version and whether you are
>>>>>>>> using Curator or the commons library will help as well.
>>>>>>>>
>>>>>>>> On Tue, Sep 26, 2017 at 2:02 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hmm..it seems machine62 became a leader but could not "register"
>>>>>>>>> as leader. Not sure what that means. My naive assumption is that 
>>>>>>>>> "becoming
>>>>>>>>> leader" and "registering as leader" are "atomic".
>>>>>>>>>
>>>>>>>>> --- grep on SchedulerLifecycle -
>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.158 [LeaderSelector-0,
>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>> STORAGE_PREPARED -> LEADER_AWAITING_REGISTRATION
>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:33 machine62
>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:33.159 [LeaderSelector-0,
>>>>>>>>> SchedulerLifecycle$4:224] Elected as leading scheduler!
>>>>>>>>> aurora-scheduler.log:Sep 26 18:11:37 machine62
>>>>>>>>> aurora-scheduler[24743]: I0926 18:11:37.204 [LeaderSelector-0,
>>>>>>>>> SchedulerLifecycle$DefaultDelayedActions:163] Giving up on
>>>>>>>>> registration in (10, mins)
>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>> aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>> SchedulerLifecycle$4:235] Framework has not been registered within the
>>>>>>>>> tolerated delay.
>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.205 [Lifecycle-0,
>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>> LEADER_AWAITING_REGISTRATION -> DEAD
>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition 
>>>>>>>>> DEAD
>>>>>>>>> -> DEAD
>>>>>>>>> aurora-scheduler.log:Sep 26 18:21:37 machine62
>>>>>>>>> aurora-scheduler[24743]: I0926 18:21:37.215 [Lifecycle-0,
>>>>>>>>> SchedulerLifecycle$6:275] Shutdown already invoked, ignoring extra 
>>>>>>>>> call.
>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:05 machine62
>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:05.681 [main,
>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition 
>>>>>>>>> IDLE
>>>>>>>>> -> PREPARING_STORAGE
>>>>>>>>> aurora-scheduler.log:Sep 26 18:22:06 machine62
>>>>>>>>> aurora-scheduler[54502]: I0926 18:22:06.396 [main,
>>>>>>>>> StateMachine$Builder:389] SchedulerLifecycle state machine transition
>>>>>>>>> PREPARING_STORAGE -> STORAGE_PREPARED
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- connecting to mesos -
>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>> 18:11:37.211750 24871 group.cpp:757] Found non-sequence node 
>>>>>>>>> 'log_replicas'
>>>>>>>>> at '/mesos' in ZooKeeper
>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>> 18:11:37.211817 24871 detector.cpp:152] Detected a new leader: 
>>>>>>>>> (id='1506')
>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>> 18:11:37.211917 24871 group.cpp:699] Trying to get
>>>>>>>>> '/mesos/json.info_001506' in ZooKeeper
>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>> 18:11:37.216063 24871 zookeeper.cpp:262] A new leading master (UPID=
>>>>>>>>> master@10.163.25.45:5050) is detected
>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>> 18:11:37.216162 24871 scheduler.cpp:470] New master detected at
>>>>>>>>> master@10.163.25.45:5050
>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>> 18:11:37.217772 24871 scheduler.cpp:479] Waiting for 12.81503ms before
>>>>>>>>> initiating a re-(connection) attempt with the master
>>>>>>>>> Sep 26 18:11:37 machine62 aurora-scheduler[24743]: I0926
>>>>>>>>> 18:11:37.231549 24868 scheduler.cpp:361] Connected with the master at
>>>>>>>>> http://10.163.25.45:5050/master/api/v1/scheduler
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 26, 2017 at 1:24 PM, Bill Farner <wfar...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> No, there is not a legitimate reason.  Did this occur for an
>>>>>>>>>> extended period of time?  Do you have logs from the scheduler 
>>>>>>>>>> indicating
>>>>>>>>>> that it lost ZK leadership and subsequently interacted with mesos?
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 26, 2017 at 1:02 PM, Mohit Jaggi <
>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Fellows,
>>>>>>>>>>> While examining Aurora log files, I noticed a condition where a
>>>>>>>>>>> scheduler was talking to Mesos but it was not showing up as a 
>>>>>>>>>>> leader in
>>>>>>>>>>> Zookeeper. It ultimately restarted itself and another scheduler 
>>>>>>>>>>> became the
>>>>>>>>>>> leader.
>>>>>>>>>>> Is there a reason a non-leading scheduler will talk to Mesos?
>>>>>>>>>>>
>>>>>>>>>>> Mohit.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Way to kill failed instances during a unsuccessful job update

2017-09-19 Thread Bill Farner
Aurora doesn't currently offer a way to do what you describe.

A job in the scheduler describes a provisioning goal (number of instances),
and we assume the scheduler shouldn't choose to modify that goal over
time.  To that end, the scheduler doesn't consider it a problem to
infinitely restart the failed instances; it is hopeful that the environment
will eventually self-heal.


On Mon, Sep 18, 2017 at 5:13 PM, Kaiwen Xu  wrote:

> Hi,
>
> I am wondering if it's there is any way for Aurora to kill the failed
> instances when a job update is not successful (e.g. apps on some
> backends
> fail to start up etc.)?
>
> Since right now, we turned off the "rollback" feature during the job
> update, because of one or two backends (out of tens to hundreds
> backends)
> failing is acceptable for us, we don't want completely rollback the
> whole fleet due to that. However, it seems like with "rollback" off,
> those failed backends will just be left there, and they will try to
> restart infinitely.
>
> Just curious what would be a recommended approach for this situation?
> Should we try to identify those instances and stop them in our own
> deployment scripts?
>
> Thanks,
> Kaiwen
>


Re: Why doesn't announcer delay until task indicates it's ready?

2017-03-21 Thread Bill Farner
Announcement is done immediately to announce presence of an instance for other 
services to determine what to do from there. A use case we considered was 
allowing monitoring of a service via HTTP before the service is ready for 
traffic. This is useful, for example, if the application has a long burn-in 
setup phase.

In your case, the expectation is that the load balancer (or other upstream 
service) handles and routes away from unavailable backends; whether it's 
because they are not yet ready or otherwise. This could be using independent 
health checks or retries, depending on what is available.

On Mar 21, 2017, 8:28 AM -0700, Richard Klancer , wrote:
> Hi all,
>
> I'm preparing to launch a public-facing Aurora based HTTP service. As
> part of this exercise my team recently attempted to `aurora update`
> the service while it was serving high request volume from an external
> load generator.
>
> We were surprised to find that our ops team was paged due to bursts of
> 502's from our frontend server, which routes external traffic to our
> service using the serverset published by the Aurora announcer. Upon
> investigation, we discovered that the serverset is announced as soon
> as the thermos executor runs, even though the app is not ready to
> serve requests right away. The 502s, of course, were due to the chosen
> server not yet being able to respond to a connection request.
>
> Last night I searched JIRA, the user and dev mailing lists, and the
> thermos code, and I didn't see any conversations about delaying
> announcement until the configured health check passes (thus indicating
> that the server is ready to accept connections)
>
> I'm curious why not? This seems like a fundamental requirement.
>
> A couple notes. First, our frontend server doesn't support explicit
> health checking, yet, though this will be implemented soon. Perhaps it
> is considered the proper task of load balancers and frontend servers
> to validate the health of servers in the serverset before routing
> traffic to them?
>
> Also, to work around this problem, we announced the serverset from the
> app itself. This means we no longer have an 'announce' section in our
> config, and thus no portmap. But http health checking is silently (in
> 0.12, though not 0.17) disabled if there is no thermos port named
> 'health'. We had our "admin" and "health" ports aliased, but with no
> portmap I had to just rename "admin" to "health" everywhere in our job
> definition. It works but it's a little silly. This was previously
> noted in https://issues.apache.org/jira/browse/AURORA-321
>
> Thanks in advance for any comments,
>
> --Richard


Re: Prevent service Job moved from one machine to another periodically

2016-06-25 Thread Bill Farner
FYI the behavior of an update will have a similar outcome - tasks are
subject to move when restarted in the course of an update.

On Saturday, June 25, 2016, Ziliang Chen <zlchen@gmail.com> wrote:

> Found the issue in the code, when doing update the job, i first did a
> kill. Thanks Bill/Erb!
>
> On Sun, Jun 26, 2016 at 1:09 AM, Bill Farner <wfar...@apache.org
> <javascript:_e(%7B%7D,'cvml','wfar...@apache.org');>> wrote:
>
>> Entering the KILLING state suggests that a user issued a kill command for
>> the service.  Does that sound plausible?
>>
>>
>> On Saturday, June 25, 2016, Ziliang Chen <zlchen@gmail.com
>> <javascript:_e(%7B%7D,'cvml','zlchen@gmail.com');>> wrote:
>>
>>> Instructed KILL.
>>>
>>>  4 minutes ago - KILLED : Instructed to kill task.
>>>
>>>- 06/25 22:32:23 LOCAL • PENDING
>>>- 06/25 22:33:06 LOCAL • ASSIGNED
>>>- 06/25 22:33:07 LOCAL • STARTING • Initializing sandbox.
>>>- 06/25 22:33:09 LOCAL • RUNNING
>>>- 06/25 22:42:15 LOCAL • KILLING • Killed by UNSECURE
>>>- 06/25 22:42:18 LOCAL • KILLED • Instructed to kill task.
>>>
>>>
>>> On Sat, Jun 25, 2016 at 9:55 PM, Erb, Stephan <
>>> stephan@blue-yonder.com> wrote:
>>>
>>>> When you go to the scheduler website, you should be able to expand the
>>>> task event history of a terminated instance (by clicking on the + icon).
>>>> What does it say there?
>>>>
>>>>
>>>>
>>>> *From: *Ziliang Chen <zlchen@gmail.com>
>>>> *Reply-To: *"user@aurora.apache.org" <user@aurora.apache.org>
>>>> *Date: *Saturday 25 June 2016 at 15:08
>>>> *To: *"user@aurora.apache.org" <user@aurora.apache.org>
>>>> *Subject: *Re: Prevent service Job moved from one machine to another
>>>> periodically
>>>>
>>>>
>>>>
>>>> Hi Erb,
>>>>
>>>>
>>>>
>>>> As always, appreciate for your quick response!
>>>>
>>>> With your statements, I can understand Aurora's philosophy absolutely.
>>>> But in my case, my service program is up and running there in good state,
>>>> it seems that Aurora scheduler will kill my service program periodically
>>>> and move it to another machine. I expect my service program running there
>>>> forever unless there is a restart/crash etc.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Jun 25, 2016 at 8:27 PM, Erb, Stephan <
>>>> stephan@blue-yonder.com> wrote:
>>>>
>>>> Hi Zi-Liang,
>>>>
>>>>
>>>>
>>>> by default, services in Aurora are not pinned to a particular machine.
>>>> This is based on the philosophy that services should be stateless and thus
>>>> not dependent on a particular host, if possible.
>>>>
>>>>
>>>>
>>>> Whenever an instance/task of your service has terminated, the scheduler
>>>> might pick any other random machine to launch a replacement. There are many
>>>> reasons why this could happen:
>>>>
>>>>
>>>>
>>>> · Your instance has crashed, ran out of memory, or simply
>>>> exited normally.
>>>>
>>>> · If enabled, your health checks may have detected that the
>>>> instance is no longer responding.
>>>>
>>>> · The agent machine it was running on failed or lost
>>>> connectivity with Mesos.
>>>>
>>>> · You have used the aurora_admin client to drain a machine.
>>>>
>>>> · You used a client command such as restart or update.
>>>>
>>>>
>>>>
>>>> If necessary, you could use constraints [1] to force Aurora to always
>>>> schedule a service on the same host. However, this is not really
>>>> recommended as it can easily lead to situations where your service cannot
>>>> be launched at all, due to missing resources of he selected host in
>>>> question.
>>>>
>>>>
>>>>
>>>> [1]
>>>> https://github.com/apache/aurora/blob/master/docs/features/constraints.md
>>>>
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Stephan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *From: *Ziliang Chen <zlchen@gmail.com>
>>>> *Reply-To: *"user@aurora.apache.org" <user@aurora.apache.org>
>>>> *Date: *Saturday 25 June 2016 at 13:08
>>>> *To: *"user@aurora.apache.org" <user@aurora.apache.org>
>>>> *Subject: *Prevent service Job moved from one machine to another
>>>> periodically
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I have "service" job scheduled by Aurora. I found periodically, the
>>>> service job will be moved from one machine to another (stop it on previous
>>>> machine and restart it on another one). May i ask if this is an expected
>>>> behavior and if it is, how to make the service job stick to one machine
>>>> unless there is a failure ?
>>>>
>>>>
>>>>
>>>> Thank you very much !
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Regards, Zi-Liang
>>>>
>>>> Mail:zlchen@gmail.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Regards, Zi-Liang
>>>>
>>>> Mail:zlchen@gmail.com
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards, Zi-Liang
>>>
>>> Mail:zlchen@gmail.com
>>>
>>
>
>
> --
> Regards, Zi-Liang
>
> Mail:zlchen@gmail.com
> <javascript:_e(%7B%7D,'cvml','Mail:zlchen@gmail.com');>
>


Re: Prevent service Job moved from one machine to another periodically

2016-06-25 Thread Bill Farner
Entering the KILLING state suggests that a user issued a kill command for
the service.  Does that sound plausible?

On Saturday, June 25, 2016, Ziliang Chen  wrote:

> Instructed KILL.
>
>  4 minutes ago - KILLED : Instructed to kill task.
>
>- 06/25 22:32:23 LOCAL • PENDING
>- 06/25 22:33:06 LOCAL • ASSIGNED
>- 06/25 22:33:07 LOCAL • STARTING • Initializing sandbox.
>- 06/25 22:33:09 LOCAL • RUNNING
>- 06/25 22:42:15 LOCAL • KILLING • Killed by UNSECURE
>- 06/25 22:42:18 LOCAL • KILLED • Instructed to kill task.
>
>
> On Sat, Jun 25, 2016 at 9:55 PM, Erb, Stephan  > wrote:
>
>> When you go to the scheduler website, you should be able to expand the
>> task event history of a terminated instance (by clicking on the + icon).
>> What does it say there?
>>
>>
>>
>> *From: *Ziliang Chen > >
>> *Reply-To: *"user@aurora.apache.org
>> " <
>> user@aurora.apache.org
>> >
>> *Date: *Saturday 25 June 2016 at 15:08
>> *To: *"user@aurora.apache.org
>> " <
>> user@aurora.apache.org
>> >
>> *Subject: *Re: Prevent service Job moved from one machine to another
>> periodically
>>
>>
>>
>> Hi Erb,
>>
>>
>>
>> As always, appreciate for your quick response!
>>
>> With your statements, I can understand Aurora's philosophy absolutely.
>> But in my case, my service program is up and running there in good state,
>> it seems that Aurora scheduler will kill my service program periodically
>> and move it to another machine. I expect my service program running there
>> forever unless there is a restart/crash etc.
>>
>>
>>
>>
>>
>> On Sat, Jun 25, 2016 at 8:27 PM, Erb, Stephan <
>> stephan@blue-yonder.com
>> > wrote:
>>
>> Hi Zi-Liang,
>>
>>
>>
>> by default, services in Aurora are not pinned to a particular machine.
>> This is based on the philosophy that services should be stateless and thus
>> not dependent on a particular host, if possible.
>>
>>
>>
>> Whenever an instance/task of your service has terminated, the scheduler
>> might pick any other random machine to launch a replacement. There are many
>> reasons why this could happen:
>>
>>
>>
>> · Your instance has crashed, ran out of memory, or simply exited
>> normally.
>>
>> · If enabled, your health checks may have detected that the
>> instance is no longer responding.
>>
>> · The agent machine it was running on failed or lost
>> connectivity with Mesos.
>>
>> · You have used the aurora_admin client to drain a machine.
>>
>> · You used a client command such as restart or update.
>>
>>
>>
>> If necessary, you could use constraints [1] to force Aurora to always
>> schedule a service on the same host. However, this is not really
>> recommended as it can easily lead to situations where your service cannot
>> be launched at all, due to missing resources of he selected host in
>> question.
>>
>>
>>
>> [1]
>> https://github.com/apache/aurora/blob/master/docs/features/constraints.md
>>
>>
>>
>> Best regards,
>>
>> Stephan
>>
>>
>>
>>
>>
>> *From: *Ziliang Chen > >
>> *Reply-To: *"user@aurora.apache.org
>> " <
>> user@aurora.apache.org
>> >
>> *Date: *Saturday 25 June 2016 at 13:08
>> *To: *"user@aurora.apache.org
>> " <
>> user@aurora.apache.org
>> >
>> *Subject: *Prevent service Job moved from one machine to another
>> periodically
>>
>>
>>
>> Hi,
>>
>>
>>
>> I have "service" job scheduled by Aurora. I found periodically, the
>> service job will be moved from one machine to another (stop it on previous
>> machine and restart it on another one). May i ask if this is an expected
>> behavior and if it is, how to make the service job stick to one machine
>> unless there is a failure ?
>>
>>
>>
>> Thank you very much !
>>
>>
>>
>> --
>>
>> Regards, Zi-Liang
>>
>> Mail:zlchen@gmail.com
>> 
>>
>>
>>
>>
>>
>> --
>>
>> Regards, Zi-Liang
>>
>> Mail:zlchen@gmail.com
>> 
>>
>>
>
>
> --
> Regards, Zi-Liang
>
> Mail:zlchen@gmail.com
> 
>


Re: Get active/running instance IDs of a job.

2016-03-27 Thread Bill Farner
You can get that data from the getConfigSummary API call:
https://github.com/apache/aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L957-L958

which populates Result.configSummaryResult:
https://github.com/apache/aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L912

follow the chain a few steps and you'll see these two:

struct ConfigGroup {
  1: TaskConfig config
  3: set instances
}

struct ConfigSummary {
  1: JobKey key
  2: set groups
}

On Sun, Mar 27, 2016 at 12:23 AM, Krish  wrote:

> When we run n instances of a job, we have the instances marked from [0,
> n-1].
> When using the addInstances() API, we also need to specify the active
> instance ID for the job to scale up.
> How can I find a list of active instance IDs? The JobStats structure only
> has the count and not the IDs themselves.
>
> My scenario for testing this was:
> Called the addInstances API for a job with instanceID parameter as 0 =>
> worked fine.
> I used the aurora CLI `aurora job kill ../../../../0 to kill the task with
> instance id 0, and then called addInstances with parameter 0 => errors out.
> In this scenario, I should query the active instanceIDs of the job, and
> pass it as a parameter to addInstances.
>
> Thanks.
>
> --
> κρισhναν
>


Re: Aurora Thrift API Info

2016-03-19 Thread Bill Farner
>
> I like the DSL that Aurora has and I was wondering if there was any way to
> use the API and somehow get the DSL translated in to what the
> ExecutorConfig structure requires?


For an ad-hoc peek at what's in the ExecutorConfig, you can follow the bit
of hacking described in this thread:
http://mail-archives.apache.org/mod_mbox/aurora-dev/201601.mbox/%3CCAC2vyCXEOt1xaPC3DK_o_-J%3Dunjg7csQAajb5Ekad-g%2B20VxHA%40mail.gmail.com%3E

As for doing it via the API, you will find the executor-specific DSL
contents in the response to getTasksStatus
https://github.com/apache/aurora/blob/ec0f38a7528f8adb8f0769112a1043b98598c03f/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L945-L946

that API call will populate Result.scheduleStatusResult:
https://github.com/apache/aurora/blob/ec0f38a7528f8adb8f0769112a1043b98598c03f/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L899

>From there the chain will be
ScheduledTask->AssignedTask->TaskConfig->ExecutorConfig

Not an obvious path, but it can be done!





On Wed, Mar 16, 2016 at 11:54 AM, Rice, Ben <ben.r...@netapp.com> wrote:

> Sorry to piggy back on the thread, but I had a related question.
>
>
>
> I’m wanting to write a GUI for creating/managing/etc jobs.  I like the DSL
> that Aurora has and I was wondering if there was any way to use the API and
> somehow get the DSL translated in to what the ExecutorConfig structure
> requires?
>
>
>
> Thanks,
>
> -Ben
>
>
>
> *From:* Krish [mailto:krishnan.k.i...@gmail.com]
> *Sent:* Wednesday, March 16, 2016 2:36 PM
> *To:* user@aurora.apache.org; Bill Farner <wfar...@apache.org>;
> ma...@apache.org
> *Subject:* Re: Aurora Thrift API Info
>
>
>
> Thanks, Maxim & Bill!
>
>
>
> I would love some more clarifications to the below observations.
>
>
>
> A little googling helped me find
> https://issues.apache.org/jira/browse/AURORA-1258, which then led me to
> http://markmail.org/message/al26gmpwlcns3oer#query:+page:1+mid:2smaej5n5e54li3g+state:results
> .
>
>
>
> Question is when I am modifying job details, in particular, scaling up
> instances based on demand, do I use the startJobUpdate or the addInstances
> API?
>
> Seems like addInstances is supposed to do this, but you mention that
> startJobUpdate is also supposed to be the "main API to change your
> service job in any way (including adding, removing or modifying instances).
> "
>
>
>
> Also, if both are valid, under what scenarios would one use startJobUpdate?
>
> Which one will be non-destructive? As in, which API does not kill current
> instances while adding new ones?
>
>
>
> And if I reduce the instances (for eg, from 6 to 5), will the API
> (addInstances or startJobUpdate) also kill the last instance of the job?
>
>
>
>
>
> --
>
> κρισhναν
>
>
>
> On Wed, Mar 16, 2016 at 10:30 PM, Bill Farner <wfar...@apache.org> wrote:
>
> Regarding documentation - Maxim is correct that there isn't much in the
> way of independent/holistic docs for the thrift API.  There is, however,
> scant javadoc-style documentation within the IDL spec itself:
> https://github.com/apache/aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift
>
>
>
> If you are looking to use the thrift API directly, the most difficult API
> method will be defining the ExecutorConfig.data value when calling
> createJob.  Please don't hesitate to ask for assistance if you get to that
> point!
>
>
>
> On Wed, Mar 16, 2016 at 9:19 AM, Maxim Khutornenko <ma...@apache.org>
> wrote:
>
> 1. All APIs require thrift inputs of the structs specified, and return
> thrift values only in Response.result field.
>
> Correct. There is also 'details' field that may have additional messages
> (of error or informational nature)
>
>
>
> 2. Is there a set of examples in the documentation to help understand
> Thrift API better?
>
> The thrift API is largely undocumented. There is an effort to bring up a
> fully supported REST API that will presumably get documented and become
> much easier to use. It's mostly in flux now.
>
>
>
> 3. createJob(JobDescription desc, Lock lock):
>
> This is the API to use when you a brand new service or adhoc (batch) job
> created. The JobDescription is populated from the .aurora config. You may
> want to trace "aurora job create" client command implementation to see how
> it happens.
>
>
>
> 4. What is the Lock object? I see that some APIs require locking and some
> don't. For example, createJob needs a Lock object as parameter, & I am
> assuming that it is required so that one does not create multiple jobs with
> the same JobKey.
>
> Ignore 

Re: Aurora Thrift API Info

2016-03-18 Thread Bill Farner
: 4, Failures: 0, Errors: 0, Skipped: 0, Time
>>>> elapsed: 0.062 sec
>>>> [junit] Running org.apache.thrift.protocol.TestTSimpleJSONProtocol
>>>> [junit] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time
>>>> elapsed: 0.046 sec
>>>>
>>>> BUILD FAILED
>>>> /tmp/thrift-0.9.3/lib/java/build.xml:202: Test
>>>> org.apache.thrift.protocol.TestTSimpleJSONProtocol failed
>>>>
>>>> Total time: 17 seconds
>>>> make[3]: *** [check-local] Error 1
>>>> make[3]: Leaving directory `/tmp/thrift-0.9.3/lib/java'
>>>> make[2]: *** [check-am] Error 2
>>>> make[2]: Leaving directory `/tmp/thrift-0.9.3/lib/java'
>>>> make[1]: *** [check-recursive] Error 1
>>>> make[1]: Leaving directory `/tmp/thrift-0.9.3/lib'
>>>> make: *** [check-recursive] Error 1
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> κρισhναν
>>>>
>>>> On Thu, Mar 17, 2016 at 7:32 PM, Chris Bannister <c.bannis...@gmail.com
>>>> > wrote:
>>>>
>>>>> I've used the latest thrift to generate go code, and then manually
>>>>> created executor config which works and is able to launch jobs.
>>>>>
>>>>> On Thu, 17 Mar 2016, 1:55 p.m. Jake Farrell, <jfarr...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Krish
>>>>>> We are using Thrift with go for all our api calls to Aurora, would
>>>>>> recommend you use Thrift 0.9.3 to interact with the api.
>>>>>>
>>>>>> happy to help answer any questions you might have
>>>>>>
>>>>>> -Jake
>>>>>>
>>>>>> On Thu, Mar 17, 2016 at 9:43 AM, Krish <krishnan.k.i...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks, Bill.
>>>>>>>
>>>>>>> Well I have started my foray into the the thrift API today. And I
>>>>>>> think I am stuck with some thrift configs.
>>>>>>>
>>>>>>> Does it matter if I use thrift v0.9.0 on the client side to talk
>>>>>>> with aurora using thrift 0.9.1? Are they compatible? I couldn't find any
>>>>>>> changelog or compatibility statement on the thrift project site.
>>>>>>>
>>>>>>>
>>>>>>> Since Aurora v0.12 uses thrift version 0.9.1, and the debian repos
>>>>>>> have 0.9.0, I had to compile the thrift compiler v0.9.1 from source.
>>>>>>> However, when I try to generate golang code, I think I hit a compiler 
>>>>>>> bug:
>>>>>>> krish@krish:/tmp
>>>>>>> > thrift --gen go api.thrift
>>>>>>> ./gen-go//api/ttypes.go:2623:6: missing ',' in composite literal
>>>>>>> ./gen-go//api/ttypes.go:2624:19: expected '==', found '='
>>>>>>> WARNING - Running 'gofmt -w ./gen-go//api/ttypes.go' failed.
>>>>>>>
>>>>>>> I can modify the golang code by hand, but I would like to play it
>>>>>>> safe and use the working compiler from the debian repos.
>>>>>>>
>>>>>>> Also, when I use thrift v0.9.0, and try to integrate code into a
>>>>>>> test golang app, it fails to find "thriftlib/api" package. Anyone faced 
>>>>>>> a
>>>>>>> similar error and gone past it?
>>>>>>> I have already done a `go get
>>>>>>> git.apache.org/thrift.git/lib/go/thrift/...`
>>>>>>> <http://git.apache.org/thrift.git/lib/go/thrift/...>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> κρισhναν
>>>>>>>
>>>>>>> On Wed, Mar 16, 2016 at 10:30 PM, Bill Farner <wfar...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Regarding documentation - Maxim is correct that there isn't much in
>>>>>>>> the way of independent/holistic docs for the thrift API.  There is,
>>>>>>>> however, scant javadoc-style documentation within the IDL spec itself:
>>>>>>>> https://github.com/apache/aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.t

Re: Announcer problem

2016-01-25 Thread Bill Farner
Continuing the arg removal discussion on a patch:
https://reviews.apache.org/r/42727/

On Mon, Jan 25, 2016 at 9:36 AM, Maxim Khutornenko <ma...@apache.org> wrote:

> On that topic, does anyone else think --announcer-enable is redundant?
>
>
> +1. I think this is the case where a single flag would suffice.
>
> On Mon, Jan 25, 2016 at 9:28 AM, Bill Farner <wfar...@apache.org> wrote:
>
>> There's also 2 flags you need to pass to the executor via the scheduler:
>> --announcer-enable, --announcer-ensemble.  See here for example:
>> https://github.com/apache/aurora/blob/master/examples/vagrant/upstart/aurora-scheduler.conf#L43
>>
>> On that topic, does anyone else think --announcer-enable is redundant?
>>
>>
>> https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py
>> app.add_option(
>> '--announcer-enable',
>> dest='announcer_enable',
>> action='store_true',
>> default=False,
>> help='Enable the ServerSet announcer for this executor.  Jobs must
>> still activate using '
>>  'the Announcer configuration.')
>>
>> app.add_option(
>> '--announcer-ensemble',
>> dest='announcer_ensemble',
>> type=str,
>> default=None,
>> help='The ensemble to which the Announcer should register
>> ServerSets.')
>>
>> Even the error message and handling of these args suggests redundancy:
>>
>>   if options.announcer_enable:
>> if options.announcer_ensemble is None:
>>   app.error('Must specify --announcer-ensemble if the announcer is
>> enabled.')
>> status_providers.append(DefaultAnnouncerCheckerProvider(
>>   options.announcer_ensemble,
>>   options.announcer_serverset_path,
>>   options.announcer_allow_custom_serverset_path
>> ))
>>
>> Seems like we should enable the announcer iff announcer_ensemble is set.
>>
>> On Mon, Jan 25, 2016 at 8:08 AM, 卢义 <masayoshi...@icloud.com> wrote:
>>
>>> Hi,
>>>
>>> I am using aurora 0.11 (installed from deb package) with mesos 0.26 on
>>> ubuntu 14.04.3.
>>>
>>> My job file:
>>> scheduler_proc = Process(
>>> name="kafka_mesos_scheduler_process",
>>> cmdline="""
>>> cd /usr/local/kafka-mesos
>>> rm kafka-mesos.properties
>>> touch kafka-mesos.properties
>>> echo 'user=root' | tee -a kafka-mesos.properties
>>> echo 'storage=zk:/mesos-kafka-scheduler' | tee -a kafka-mesos.properties
>>> echo 'master=zk://ourtmx01:2181,ourtmx02:2181,ourtmx05:2181/mesos' |
>>> tee -a kafka-mesos.properties
>>> echo ‘zk=myzkenpoints/kafka02' | tee -a kafka-mesos.properties
>>> echo 'api=http://0.0.0.0:{{thermos.ports[http]}}' | tee -a
>>> kafka-mesos.properties
>>> cat kafka-mesos.properties
>>> ./kafka-mesos.sh scheduler
>>> """)
>>>
>>> scheduler_task = Task(
>>>   name = 'run_scheduler',
>>>   processes = [scheduler_proc],
>>>   resources = Resources(cpu = 0.5, ram = 512*MB, disk=128*MB))
>>>
>>> jobs = [
>>>   Service(cluster = ‘mycluster',
>>>   environment = 'prod',
>>>   role = 'root',
>>>   name = 'kafka-mesos',
>>>   task = scheduler_task,
>>>   announce = Announcer(),
>>>   container = Container(docker = Docker(image =
>>> ‘myregistryserver:5000/kafka-mesos-scheduler:0.9')))]
>>>
>>> The job was running well, but I did’t find any ServerSets added to my
>>> ZK. There are only scheduler and replicated-log in /aurora.
>>>
>>>
>>>
>>>
>>
>


Re: Pre-checking if job can be scheduled?

2016-01-12 Thread Bill Farner
I think that would be a cool addition to the API, and relatively easy to
implement.  I'd be happy to shepherd if you are willing to take a crack at
a patch!

On Tue, Jan 12, 2016 at 2:56 PM, Brian Hatfield 
wrote:

> Hi,
>
> We currently run a (relatively) small Mesos/Aurora cluster, and don't
> always have significant resource overhead available.
>
> Sometimes, we go to schedule a job and we're just short of what we
> estimated-by-hand we'd need in the cluster for it. Most of the tasks
> schedule - but a few stay "PENDING" because of the resource constraint.
> This often confuses users, or in some cases, causes the command to block
> for a while until it eventually times out.
>
> We're currently working on automating somewhat-more-precise basic
> estimation with information sourced from /offers to get a sense of "nope,
> your task won't schedule" to provide fast feedback that doesn't manipulate
> the state of the cluster.
>
> A friend recommended that I suggest to this mailing list something
> integrated into Aurora to accomplish this instead - since our basic
> estimation doesn't include co-scheduling constraints, quotas, etc.
>
> So: We believe that this feature doesn't exist in Aurora today, and wanted
> to suggest it as a future feature for the project.
>
> Thanks :-)
> Brian
>


Re: Pre-checking if job can be scheduled?

2016-01-12 Thread Bill Farner
Quick pointers for after you read the contributing doc:

1. Skim the doc for developing on the scheduler
https://github.com/apache/aurora/blob/master/docs/developing-aurora-scheduler.md

2. Add the new API method
https://github.com/apache/aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L953

3. Implement the API method
https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ReadOnlySchedulerImpl.java#L113

4. To answer the question you're asking, you need guice-injected
OfferManager:
https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java#L113
and
SchedulingFilter:
https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/filter/SchedulingFilter.java#L326-L334






On Tue, Jan 12, 2016 at 6:53 PM, Brian Hatfield 
wrote:

> Wow!
>
> Thanks for the positive feedback and fast responses!
>
> @john/bill - Yes, I'd be happy to do at a minimum [1], and I am willing to
> do [2] but am currently completely unfamiliar with the codebase. I'll read
> the contributing docs and pull down the code and see if I can figure out a
> guess of a way forward, and then report in if I think I can do it.
>
> Thanks!
> Brian
>
> On Tue, Jan 12, 2016 at 6:22 PM, Andrew Jorgensen <
> and...@andrewjorgensen.com> wrote:
>
>> One other case to take into account which complicates the logic a bit is
>> we have some jobs that need to be stopped and then started again usually
>> with either code changes or capacity increases. In this case we would
>> need to have the resources already consumed for the job factored back in
>> to determine whether there is enough room to run the job. I think for a
>> first pass a simple yes/no on outstanding offers would be good but for
>> our use case we would need to supply an existing job as an argument to
>> tell the offers check to add those resources back when considering
>> whether there is enough room or not.
>>
>> This can get a bit race conditiony if you have multiple people starting
>> and stopping jobs in the cluster. It may also be interesting to have an
>> addition to the deploy task that says something like "if you can deploy
>> this do it if not then don't do anything and exit with an error" or
>> something like that. I'm not sure what guarantees you can make between
>> the check and the actual deploy based on other things that are going on
>> in the cluster but that would definitely be an awesome improvement for
>> that use case.
>>
>> --
>> Andrew Jorgensen
>> @ajorgensen
>>
>> On Tue, Jan 12, 2016, at 06:14 PM, John Sirois wrote:
>> > On Tue, Jan 12, 2016 at 3:56 PM, Brian Hatfield 
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > We currently run a (relatively) small Mesos/Aurora cluster, and don't
>> > > always have significant resource overhead available.
>> > >
>> > > Sometimes, we go to schedule a job and we're just short of what we
>> > > estimated-by-hand we'd need in the cluster for it. Most of the tasks
>> > > schedule - but a few stay "PENDING" because of the resource
>> constraint.
>> > > This often confuses users, or in some cases, causes the command to
>> block
>> > > for a while until it eventually times out.
>> > >
>> > > We're currently working on automating somewhat-more-precise basic
>> > > estimation with information sourced from /offers to get a sense of
>> "nope,
>> > > your task won't schedule" to provide fast feedback that doesn't
>> manipulate
>> > > the state of the cluster.
>> > >
>> > > A friend recommended that I suggest to this mailing list something
>> > > integrated into Aurora to accomplish this instead - since our basic
>> > > estimation doesn't include co-scheduling constraints, quotas, etc.
>> > >
>> > > So: We believe that this feature doesn't exist in Aurora today, and
>> wanted
>> > > to suggest it as a future feature for the project.
>> > >
>> >
>> > I think this would be a great feature from simple yes/no to more
>> > sophisticated likelyhood estimates even based on time of day (cron job
>> > scheduling taken into account):
>> > 1. A ticket [1] describing the minimum viable feature.
>> > 2. Work towards implementation [2].
>> >
>> > Would you be willing to do any of these? I'd be willing to review
>> designs
>> > and reviews.
>> >
>> > [1] https://issues.apache.org/jira/secure/CreateIssue!default.jspa
>> > [2] http://aurora.apache.org/documentation/latest/contributing/
>> >
>> >
>> > > Thanks :-)
>> > > Brian
>> > >
>>
>
>


[ANNOUNCE] Apache Aurora 0.11.0 debian packages

2016-01-10 Thread Bill Farner
I'm pleased to announce that official debian packages for Aurora are now
available!

You can find the files here:
https://bintray.com/apache/aurora/debian-ubuntu-trusty/0.11.0


Cheers,

Bill


[ANNOUNCE] 0.11.0 release

2015-12-23 Thread Bill Farner
Hello folks,

I'm pleased to announce that Apache Aurora 0.11.0 has been released!

More details can be found in the blog post:
https://aurora.apache.org/blog/aurora-0-11-0-released/

Thanks to the many people who made this release possible!


Cheers,

Bill