Looks like a good step forward. What is the reason for the algorithm having to call reconcile tasks multiple times after waiting some time in step 6? Shouldn't it be just once per (re)registration?
Are there time bound guarantees within which a task update will be sent out after a reconcile request is sent? In the algorithm for task reconciliation, what would be a good timeout after which we conclude that we got no task update from the master? Upon such a timeout, I would be tempted to conclude that the task has disappeared. In which case, I would call driver.killTask() (to be sure its marked as gone), mark my task as terminated, then submit a replacement task. Does the "rate limiting" feature (in the works?) affect task reconciliation due to the volume of task updates sent back? Thanks. On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler <[email protected]> wrote: > Hi all, > > I've sent a review out for a document describing reconciliation, you can > see the draft here: > https://gist.github.com/bmahler/18409fc4f052df43f403 > > Would love to gather high level feedback on it from framework developers. > Feel free to reply here, or on the review: > https://reviews.apache.org/r/26669/ > > Thanks! > Ben >

