aurora "environment" flexibility

2017-10-02 Thread Mohit Jaggi
Folks, Why is the environment limited to a few strings? Is there a way I can customize this? Mohit. Error loading configuration: Environment should be one of "prod", "devel", "test" or staging! Got us1.production

Re: fix for aurora-1945

2017-10-02 Thread Mohit Jaggi
this 'global ban' routine > was intending ti plug. > > On Mon, Oct 2, 2017 at 4:49 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > >> I was wondering if that case can be checked and the banning skipped. Is >> there a place where we store "used" o

Re: fix for aurora-1945

2017-10-02 Thread Mohit Jaggi
t useless > > > That is effectively correct. > > On Mon, Oct 2, 2017 at 5:10 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > >> True. I also see that the pool of threads for processing offers has a >> size of 1. Am I reading that wrong? Will a larger pool increase >>

leader election issues

2017-09-26 Thread Mohit Jaggi
Fellows, While examining Aurora log files, I noticed a condition where a scheduler was talking to Mesos but it was not showing up as a leader in Zookeeper. It ultimately restarted itself and another scheduler became the leader. Is there a reason a non-leading scheduler will talk to Mesos? Mohit.

Re: leader election issues

2017-09-26 Thread Mohit Jaggi
, such as two aurora schedulers believing they are leaders. If > you could tell us a little bit about your ZK set up we might be able to > narrow down the issue. Also, Aurora version and whether you are using > Curator or the commons library will help as well. > > On Tue, Sep 26, 2017

Re: leader election issues

2017-09-26 Thread Mohit Jaggi
gt; Is there a reason a non-leading scheduler will talk to Mesos > > > No, there is not a legitimate reason. Did this occur for an extended > period of time? Do you have logs from the scheduler indicating that it > lost ZK leadership and subsequently interacted with mesos? > > On Tue, Sep 26, 2017 a

Re: leader election issues

2017-09-26 Thread Mohit Jaggi
aurora-scheduler[24743]: E0926 18:21:37.205 [Lifecycle-0, SchedulerLifecycle$4:235] Framework has not been registered within the tolerated delay. On Tue, Sep 26, 2017 at 2:34 PM, John Sirois <john.sir...@gmail.com> wrote: > > > On Tue, Sep 26, 2017 at 3:31 PM, Mohit Jaggi <moh

Re: leader election issues

2017-09-26 Thread Mohit Jaggi
ecuted against the newly > connected master. > You need to be careful about what you derive from the logs just based on a > reading of the words. Generally you'll need to look carefully / grep > sourcecode to be sure you are mentally modelling the code flows correctly. > It certainly g

Re: Lost framework registered event [Was Re: leader election issues]

2017-09-28 Thread Mohit Jaggi
re > complete logs? In particular, logs during the 10 minute delay would be > particularly helpful. > > On Tue, Sep 26, 2017 at 11:51 PM, Mohit Jaggi <mohit.ja...@uber.com> > wrote: > >> Updating subject...as it looks like leader election was fine but >> registrat

Re: aurora crash in PendingTaskProcessor

2017-09-29 Thread Mohit Jaggi
specific bug in the suspect code > (OfferManager.java), but it does stand out as subject to races. > Specifically, there is a lack of synchronization when checking for an offer > exists for a given agent ID and subsequently removing that offer. > > Can you file a bug? > > On Th

Re: aurora crash in PendingTaskProcessor

2017-09-29 Thread Mohit Jaggi
This exhibits a classic check-then-act race on hostOffers, which could > allow a second offer with the same agent ID. An obvious fix here would be > to move the "if exists, remove, else add" sequence in a synchronized method > in hostOffers. > > Happy to help guide you on

Re: Aurora pauses adding offers

2017-11-27 Thread Mohit Jaggi
r more detail about how Aurora is being used in this regard? > I haven't seen use cases in the past that would be amenable to this > behavior, so i would like to understand better. > > > On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mohit.ja...@uber.com> > wrote: > >> Thanks

Re: Aurora pauses adding offers

2017-11-27 Thread Mohit Jaggi
conciliation logic into it. > > On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mohit.ja...@uber.com> > wrote: > >> Imagine something like Spinnaker using Aurora underneath to schedule >> services. That layer often "amplifies" human effort and may resul

Re: Aurora pauses adding offers

2017-11-28 Thread Mohit Jaggi
g bottlenecks. > > On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > >> I think more explicit signaling is better. Increased latency can be due >> to other conditions like network issues etc. Right now our mitigation >> involves loa

Re: Apache Aurora holding resources which makes other framework starve

2017-11-25 Thread Mohit Jaggi
Command line params on Aurora and Mesos control this. The "config file" for this may depend on how your cluster is managed. It can be in puppet manifest, for example. See below for the parameters. Docs are http://mesos.apache.org/documentation/latest/configuration/master/ and

Re: HTTP API examples

2017-11-28 Thread Mohit Jaggi
pibeta, just be aware that issues you encounter may not be fixed. That > said, it has been in place for ~3 years and would probably not be removed > unless it impedes other work, or a superior replacement is introduced. > > On Mon, Nov 27, 2017 at 11:39 AM, Mohit Jaggi <mohit.ja...@ub

Re: Aurora pauses adding offers

2017-11-29 Thread Mohit Jaggi
ls and/or introspect arguments, you > would be better off binding a layer for AuroraAdmin.Iface > <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30> > . > > On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mo

Re: Aurora pauses adding offers

2017-11-29 Thread Mohit Jaggi
gt; > Assuming this means 5 quorum member - no, that should not be a problem. > > If any of the above became an issue for the scheduler, it should certainly > manifest in logs. > > > On Wed, Nov 29, 2017 at 1:26 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > >> Thanks Bill. I

Re: sliding stats testing

2017-12-04 Thread Mohit Jaggi
or the purposes of this test, it may > be easiest to integrate with TimeSeriesRepositoryImpl and manually induce > sampling. > > On Sat, Dec 2, 2017 at 12:17 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > >> Folks, >> I am trying to write a test case and

reverting logback dependency update

2017-11-20 Thread Mohit Jaggi
Folks, Due to a conflict with another tool we use, I can't use logback 1.2.3 and slf4j 1.7.25 yet. Is it safe to change them to the previous values? Ref: https://github.com/apache/aurora/commit/d7425aa56d3fba98f4a16cb93bff8f9ce7ce0e67 Mohit.

Re: reverting logback dependency update

2017-11-20 Thread Mohit Jaggi
hink it is fair to the community or practical to hold back library > versions because of conflicts in proprietary custom builds of Aurora. So in > general, i am -1 on the precedent this would set. > >> On Mon, Nov 20, 2017 at 5:53 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: >>

Re: reverting logback dependency update

2017-11-20 Thread Mohit Jaggi
thanks :) On Mon, Nov 20, 2017 at 7:18 PM, Bill Farner <wfar...@apache.org> wrote: > Aha. Yes, i suspect you will be fine to revert these locally. > > On Mon, Nov 20, 2017 at 7:11 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > >> I should have been clear. I m

Re: Aurora pauses adding offers

2017-11-10 Thread Mohit Jaggi
he dynamic reservation work done by Dmitri. We also have commits for offer/rescind race issue, setrootfs patch (which is not upstreamed yet). - - - - - I have cherrypicked the fix for Aurora-1952 as well. > On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: >

Re: Aurora pauses adding offers

2017-11-10 Thread Mohit Jaggi
Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that cause or effect? :-) On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > Thanks Bill. Please see inline: > > On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wfar...@apache.org>

Re: Aurora pauses adding offers

2017-11-10 Thread Mohit Jaggi
and in log_storage_write_lock_wait_ns_per_event On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that > cause or effect? :-) > > On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mohit.

Re: [ANNOUNCE] 0.18.1 release

2017-11-02 Thread Mohit Jaggi
Kewl! On Wed, Nov 1, 2017 at 11:41 AM, Bill Farner wrote: > Hello folks, > > I'm pleased to announce that Apache Aurora 0.18.1 has been released! > > More details can be found in the blog post: https://aurora.apache.org/ > blog/aurora-0-18-1-released/ > > > Cheers, > > Bill

Aurora pauses adding offers

2017-11-09 Thread Mohit Jaggi
Folks, I have noticed some weird behavior in Aurora (version close to 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I tail the logs I can see offers are coming in. I suspect they are getting enqueued for processing by "executor" but stay there for a long time and are not

Re: Aurora pauses adding offers

2017-11-09 Thread Mohit Jaggi
I also notice a lot of "Timeout reached for task..." around the same time. Can this happen if task is in PENDING state and does not reach ASSIGNED due to lack of offers? On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > Folks, > I have notic

Re: distinguishing failure types during upgrade

2017-11-01 Thread Mohit Jaggi
service has alerts > firing. > > > > On Tue, Oct 31, 2017 at 1:14 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > >> Folks, >> Sometimes in our cluster upgrades start failing due to transient outages >> of dependencies or reasons unrelated to the new co

Re: shutdown vs kill API is Mesos

2017-12-09 Thread Mohit Jaggi
a could not use the "new" API because of >>> performance issues in the implementation, but i do not know where that >>> stands today. >>> >>> https://mesos.apache.org/documentation/latest/scheduler-http >>> -api/#shutdown >>> >>>

updateconfig doc

2017-10-30 Thread Mohit Jaggi
Folks, Does the following doc mean A or B? *A*: batch_size is the number of instances in a given shard *B:* batch_size is the number of shards. So every batch has (number of instances)/(batch_size) tasks. Mohit. UpdateConfig Objects Parameters for controlling the rate and policy of rolling

distinguishing failure types during upgrade

2017-10-31 Thread Mohit Jaggi
Folks, Sometimes in our cluster upgrades start failing due to transient outages of dependencies or reasons unrelated to the new code being pushed out. Aurora hits its failure threshold and starts automatic rollback which may make a bad condition worse (e.g. if the outage was related to load

Re: updateconfig doc

2017-10-30 Thread Mohit Jaggi
s across the >> instances of the service. >> >> i.e. if batch_size is 3, the updater will start updating 3 instances >> immediately, and proceed through all instances with 3 instances updating a >> any time until it reaches the end. >> >> Does that clarify?

Re: orphaned thermos

2017-10-30 Thread Mohit Jaggi
*Friday, 27. October 2017 at 05:34 > *To: *"user@aurora.apache.org" <user@aurora.apache.org> > *Subject: *Re: orphaned thermos > > > > If the executor runs out of memory, i think it should be assumed that it > will no longer be well-behaved. It seems most sensib

orphaned thermos

2017-10-26 Thread Mohit Jaggi
We found several zombie executors on a cluster. Thermos logs indicate reaching system limits while trying to shutdown(?). Mesos agent is unable to get status of this container from docker daemon (docker inspect fails). Shouldn't thermos exit in such a case? 22 WARNING: Your kernel does not

Re: Lost framework registered event [Was Re: leader election issues]

2017-10-27 Thread Mohit Jaggi
that we now have a good handle on the culprit! More > details at https://issues.apache.org/jira/browse/AURORA-1953 > > On Thu, Sep 28, 2017 at 2:14 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > >> Hmm...it is a very busy cluster and 10 mins of logs will be voluminou

executor id from task id?

2017-12-23 Thread Mohit Jaggi
Folks, I am trying to work on this: https://issues.apache.org/jira/browse/AURORA-1960 In VersionedSchedulerDriver

Re: "Error accessing PooledConnection. Connection is invalid.

2017-12-23 Thread Mohit Jaggi
ackend for its internal > storage. Besides eliminating the mentioned error, it should also lead to > significant performance improvements. > > > > Best regards, > > Stephan > > > > *From: *Mohit Jaggi <mohit.ja...@uber.com> > *Reply-To: *"u

Re: shutdown vs kill API is Mesos

2018-01-09 Thread Mohit Jaggi
he "new" API because of >> performance issues in the implementation, but i do not know where that >> stands today. >> >> https://mesos.apache.org/documentation/latest/scheduler- >> http-api/#shutdown >> >>> NOTE: This is a new call that was not pres

Re: shutdown vs kill API is Mesos

2018-01-16 Thread Mohit Jaggi
We still need "Agent ID" for the shutdown call. On Tue, Jan 16, 2018 at 1:57 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > Sounds good. But do we really need the command line option? One can use an > older Driver if KILL is preferred for some reason. > > On Tue, J

Re: shutdown vs kill API is Mesos

2018-01-16 Thread Mohit Jaggi
> Does anyone see an issue with this approach? > > On Tue, Jan 16, 2018 at 11:15 AM, Mohit Jaggi <mohit.ja...@uber.com> > wrote: > >> To do this in a backward compatible manner, one way is : >> >> ``` >> void destroy(taskId, executorId, agentId) { >>

Re: shutdown vs kill API is Mesos

2018-01-16 Thread Mohit Jaggi
n? > > > Aurora can run tasks without an executor. I'm assuming the shutdown call > is incompatible with that mode. > > On Tue, Jan 16, 2018 at 1:57 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > >> We still need "Agent ID" for the shutdown call. &g

Re: shutdown vs kill API is Mesos

2018-01-17 Thread Mohit Jaggi
es it already handle that? On Tue, Jan 16, 2018 at 4:48 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > So that is pretty much what I proposed... > > If the method signature has to change, we can keep the executorId as it > is, unless we want to take this opportunity to clean t

Re: shutdown vs kill API is Mesos

2018-01-12 Thread Mohit Jaggi
Summary so far: - Bill supports making this change - This change cannot be made in a backward compatible manner - David (Twitter) does not want to use HTTP APIs due to performance concerns. I conclude that folks from Twitter don't support this change Question: - Are there other users that want

Re: shutdown vs kill API is Mesos

2018-01-11 Thread Mohit Jaggi
sure about the Shutdown call, as you mentioned, the versioned >>> driver seems to have the method but the driver interface does not. This >>> might get tricky from here on in since Mesos has V1 only compatible calls. >>> >>> On Thu, Jan 11, 2018 at 1:24

Re: shutdown vs kill API is Mesos

2018-01-11 Thread Mohit Jaggi
dSchedulerDriverService.java#L50 > > On Tue, Jan 9, 2018 at 1:21 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > >> David, >> Where can I find this code? >> >> Mohit. >> >> On Sat, Dec 9, 2017 at 4:27 PM, David McLaughlin <dmclaugh...@apache.org&g

Re: kill task for unknown task id

2018-02-02 Thread Mohit Jaggi
them. Could be because > of a race between aurora and mesos and crashes involved. Just a guess. > > Thx > > On Feb 1, 2018, at 4:45 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > > Any idea folks? > > On Tue, Jan 23, 2018 at 1:57 PM, Mohit Jaggi <mohit.ja...@uber.com&g

Re: testing SHUTDOWN call with V1Mesos, native library missing

2018-02-01 Thread Mohit Jaggi
Appreciate any pointers to fix this. On Tue, Jan 23, 2018 at 1:37 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > Folks, > I am adding a test case for testing the call to Mesos SHUTDOWN. For that I > replaced Mesos with V1Mesos in VersionedSchedulerDriverServiceTest.java. > I

Re: kill task for unknown task id

2018-02-01 Thread Mohit Jaggi
Any idea folks? On Tue, Jan 23, 2018 at 1:57 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > Folks, > While changing code to use Mesos's SHUTDOWN call instead of KILL, I see > that there is a unit test ( > >- StateManagerImplTest &g

tiers.info question

2018-02-05 Thread Mohit Jaggi
Folks, What are the implications of changing tiers.json on the replicated log? If a tier is removed, for example, will the Aurora code fail to read the replicated log on scheduler restart? Mohit.

"Error accessing PooledConnection. Connection is invalid.

2017-12-22 Thread Mohit Jaggi
Folks, While running load tests on Aurora I see the following. What does it mean? This is the new code I added (we discussed this a few weeks ago) to decline incoming update API calls when Aurora is slowing down(not triggered in this case as this is a get call). The