Trying to get task reconciliation to work

2014-04-17 Thread Sharma Podila
Hello,

I don't seem to have reconcileTasks() working for me and was wondering if I
am either using it incorrectly or hitting a problem. Here's what's
happening:

1. There's one Mesos (0.18) master, one slave, one framework, all running
on Ubuntu 12.04
2. Mesos master and slave come up fine (using Zookeeper, but that isn't
relevant here, I'd think)
3. My framework registers and gets offers
4. Two tasks are launched, both start running fine on the single available
slave
5. I restart my framework. During restart my framework knows that it had
previously launched two tasks that were last known to be in running state.
Therefore, upon getting the registered() callback, it calls
driver.reconcileTasks() for the two tasks. In actuality, the tasks are
still running fine. I see this in mesos master logs:

I0417 12:26:27.207361 27301 master.cpp:2154] Performing task state
reconciliation for framework MyFramework

​But, no other logs about reconciliation.​

6. My framework gets no callback about status of tasks that it requested
reconciliation on.

At this point, I am not sure if the lack of a callback for status update is
due to
  a) the fact that my framework asked for reconciliation on running state,
which Mesos also knows to be true, therefore, no status update
  b) Or, if the reconcile is not working. (hopefully this; reason (a) would
be problematic)

So, I then proceed to another test:

7. kill my framework and mesos master
8. Then, kill the slave (as an aside, this seems to have killed the tasks
as well)
9. Restart mesos master
10. Restart my framework. Now, again the reconciliation is requested.
11. Still no callback.

At this time, mesos master doesn't know about the slave because it hasn't
returned since master restarted.
What is the expected behavior for reconciliation under these circumstances?

12. Restarted slave
13. Killed and restarted my framework.
14. Still no callback for reconciliation.

Given these results, I can't see how reconciliation is working at all. I
did try this with Mesos 0.16 first and then upgraded to 0.18 to see if it
makes a difference.

Thank you for any ideas on getting this resolved.

Sharma


Re: Trying to get task reconciliation to work

2014-04-17 Thread Sharma Podila
Should've looked at the code before sending the previous email...
master/main.cpp confirmed what I needed to know. It doesn't look like I
will be able to use reconcileTasks the way I thought I could. Effectively,
a lack of callback could either mean that the master agrees with the
requested reconcile task state, or that the task and/or slave is currently
unknown. Which makes it an unreliable source of data. I understand this is
expected to improve later by leveraging the registrar, but, I suspect
there's more to it.

I take it then that individual frameworks need to have their own mechanisms
to ascertain the state of their tasks.


On Thu, Apr 17, 2014 at 12:53 PM, Sharma Podila spod...@netflix.com wrote:

 Hello,

 I don't seem to have reconcileTasks() working for me and was wondering if
 I am either using it incorrectly or hitting a problem. Here's what's
 happening:

 1. There's one Mesos (0.18) master, one slave, one framework, all running
 on Ubuntu 12.04
 2. Mesos master and slave come up fine (using Zookeeper, but that isn't
 relevant here, I'd think)
 3. My framework registers and gets offers
 4. Two tasks are launched, both start running fine on the single available
 slave
 5. I restart my framework. During restart my framework knows that it had
 previously launched two tasks that were last known to be in running state.
 Therefore, upon getting the registered() callback, it calls
 driver.reconcileTasks() for the two tasks. In actuality, the tasks are
 still running fine. I see this in mesos master logs:

 I0417 12:26:27.207361 27301 master.cpp:2154] Performing task state
 reconciliation for framework MyFramework

 ​But, no other logs about reconciliation.​

 6. My framework gets no callback about status of tasks that it requested
 reconciliation on.

 At this point, I am not sure if the lack of a callback for status update
 is due to
   a) the fact that my framework asked for reconciliation on running state,
 which Mesos also knows to be true, therefore, no status update
   b) Or, if the reconcile is not working. (hopefully this; reason (a)
 would be problematic)

 So, I then proceed to another test:

 7. kill my framework and mesos master
 8. Then, kill the slave (as an aside, this seems to have killed the tasks
 as well)
 9. Restart mesos master
 10. Restart my framework. Now, again the reconciliation is requested.
 11. Still no callback.

 At this time, mesos master doesn't know about the slave because it hasn't
 returned since master restarted.
 What is the expected behavior for reconciliation under these circumstances?

 12. Restarted slave
 13. Killed and restarted my framework.
 14. Still no callback for reconciliation.

 Given these results, I can't see how reconciliation is working at all. I
 did try this with Mesos 0.16 first and then upgraded to 0.18 to see if it
 makes a difference.

 Thank you for any ideas on getting this resolved.

 Sharma