FYI....I had a quick chat with Vinod from the Mesos team. I have some questions for Aurora users inline:
*Originally the default was the COMMAND executor. In this world the scheduler has no visibility into the command executor.* *More recently, we added a DEFAULT executor which is used by frameworks when they want to launch pod like task groups* *The SHUTDOWN executor call is only applicable if a scheduler uses CUSTOM or DEFAULT executor *and* uses v1 scheduler API.* Q1: Does Aurora use COMMAND or DEFAULT executor? *note that SHUTDOWN is not as robust as you might think :slightly_smiling_face:* *for one, there is no reconciliation API for the executor state. it is very much best effort. * *KILL is more robust for killing tasks, because task status updates are reliably delivered and there is reconciliation API* Q2: I think that this is ok as Aurora's reconciliation will still work as we don't have "executor state". "task state" will be a good and correct proxy for that. Aurora will send SHUTDOWN again and again until it succeeds in the same way as it does now with KILL. Right? Q3: Does thermos executor need any changes to respond to SHUTDOWN or does it already handle that? On Tue, Jan 16, 2018 at 4:48 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > So that is pretty much what I proposed... > > If the method signature has to change, we can keep the executorId as it > is, unless we want to take this opportunity to clean that up. I will check > if the SHUTDOWN works in non-executor cases also. > > On Tue, Jan 16, 2018 at 3:03 PM, Bill Farner <wfar...@apache.org> wrote: > >> We still need "Agent ID" for the shutdown call. >> >> >> Darn. In that case, how about we change the method signature in Driver >> to accept agentId and ignore that param in MesosSchedulerDriver. >> >> But do we really need the command line option? >> >> >> Aurora can run tasks without an executor. I'm assuming the shutdown call >> is incompatible with that mode. >> >> On Tue, Jan 16, 2018 at 1:57 PM, Mohit Jaggi <mohit.ja...@uber.com> >> wrote: >> >>> We still need "Agent ID" for the shutdown call. >>> >>> On Tue, Jan 16, 2018 at 1:57 PM, Mohit Jaggi <mohit.ja...@uber.com> >>> wrote: >>> >>>> Sounds good. But do we really need the command line option? One can use >>>> an older Driver if KILL is preferred for some reason. >>>> >>>> On Tue, Jan 16, 2018 at 1:51 PM, Bill Farner <wfar...@apache.org> >>>> wrote: >>>> >>>>> This situation is much simpler if task ID == executor ID. I can't >>>>> come up with a good reason why this is not the case today. Our executor >>>>> IDs originally included static prefix, though i do not recall any >>>>> rationale >>>>> for this. When Renan added custom executor support, this static prefix >>>>> was >>>>> made configurable. Again, i do not believe there was any rationale for >>>>> the >>>>> utility of executor IDs. >>>>> >>>>> I propose the following: >>>>> - Change relevant code in MesosTaskFactory to >>>>> setExecutorId(task.getTaskId()) >>>>> - Add a command line parameter (default false) to toggle use of >>>>> executor shutdown in VersionedSchedulerDriverService.killTask >>>>> >>>>> Does anyone see an issue with this approach? >>>>> >>>>> On Tue, Jan 16, 2018 at 11:15 AM, Mohit Jaggi <mohit.ja...@uber.com> >>>>> wrote: >>>>> >>>>>> To do this in a backward compatible manner, one way is : >>>>>> >>>>>> ``` >>>>>> void destroy(taskId, executorId, agentId) { >>>>>> >>>>>> if(driver instanceOf Versioned....) >>>>>> (Versioned...)driver.shutdown(executorId, agentId) >>>>>> else >>>>>> driver.kill(taskId) >>>>>> >>>>>> } >>>>>> ``` >>>>>> >>>>>> Any other opinions? >>>>>> >>>>>> On Tue, Jan 16, 2018 at 11:12 AM, David McLaughlin < >>>>>> dmclaugh...@apache.org> wrote: >>>>>> >>>>>>> Nope, I support getting SHUTDOWN in for users of the new API. >>>>>>> >>>>>>> On Tue, Jan 16, 2018 at 11:06 AM, Mohit Jaggi <mohit.ja...@uber.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Are you suggesting that we delay the switch to SHUTDOWN call until >>>>>>>> this working group can resolve the API perf issue? >>>>>>>> >>>>>>>> On Mon, Jan 15, 2018 at 3:55 PM, David McLaughlin < >>>>>>>> dmclaugh...@apache.org> wrote: >>>>>>>> >>>>>>>>> We are working with Mesos folks to resolve it. There is a Mesos >>>>>>>>> performance working group that folks can join if they'd like to >>>>>>>>> contribute: >>>>>>>>> http://mesos.apache.org/blog/performance-working-group-progr >>>>>>>>> ess-report/ >>>>>>>>> >>>>>>>>> I'm not sure what you mean by branch. Everything we used to scale >>>>>>>>> test is on master. >>>>>>>>> >>>>>>>>> On Mon, Jan 15, 2018 at 10:08 AM, Meghdoot bhattacharya < >>>>>>>>> meghdoo...@yahoo.com> wrote: >>>>>>>>> >>>>>>>>>> David, should twitter try against mesos 1.5 to see if things are >>>>>>>>>> better with the new api instead of libmesos. This is going to be a >>>>>>>>>> drift >>>>>>>>>> over time that will stop us from adopting new features. >>>>>>>>>> >>>>>>>>>> If it was sometime back it would be good to rerun the tests and >>>>>>>>>> open a ticket in Mesos if issues exist. All aurora users can then >>>>>>>>>> push for >>>>>>>>>> resolution. >>>>>>>>>> >>>>>>>>>> Also details on branch etc that has the api integration? >>>>>>>>>> >>>>>>>>>> Thx >>>>>>>>>> >>>>>>>>>> On Jan 12, 2018, at 11:39 AM, David McLaughlin < >>>>>>>>>> dmclaugh...@apache.org> wrote: >>>>>>>>>> >>>>>>>>>> I'm not sure I agree with the summary. Bill's proposal was using >>>>>>>>>> shutdown only when using the new API. I would also support this if >>>>>>>>>> it's >>>>>>>>>> possible. >>>>>>>>>> >>>>>>>>>> On Fri, Jan 12, 2018 at 11:14 AM, Mohit Jaggi < >>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>> >>>>>>>>>>> Summary so far: >>>>>>>>>>> - Bill supports making this change >>>>>>>>>>> - This change cannot be made in a backward compatible manner >>>>>>>>>>> - David (Twitter) does not want to use HTTP APIs due to >>>>>>>>>>> performance concerns. I conclude that folks from Twitter don't >>>>>>>>>>> support this >>>>>>>>>>> change >>>>>>>>>>> >>>>>>>>>>> Question: >>>>>>>>>>> - Are there other users that want this change? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >