Re: Review Request 35433: Sent StatusUpdates if checkpointed resources don't exist on the slave.

2015-06-20 Thread Alexander Rukletsov


> On June 14, 2015, 10:46 a.m., Benjamin Hindman wrote:
> > Just so I understand, does this mean if we happen to get in the unfortunate 
> > situation where a slave has neglected to get the dynamic reservation 
> > because it was just starting up and then it gets the task launch it will 
> > shutdown the slave because the CHECK will fail? I would expect the slave to 
> > simply send a TASK_LOST. Said another way, this is not an assertion our 
> > code guarantees. If instead we were waiting for some kind of an ack from 
> > the slave that it received the dynamic reservation before it send the task 
> > launch then a CHECK would make sense.
> 
> Jie Yu wrote:
> We don't expect this to happen because we always send a 
> CheckpointResourcesMessage before sending the task to the slave and TCP 
> ensures in order delivery (out of order delivery is possible if two sockets 
> are used. it's possible because the way we create ephemeral connections, but 
> this is very unlikely to happen). Master won't send the task to the slave if 
> the slave hasn't registered.
> 
> I would rather keep the CHECK here unless we found that this is a real 
> issue (and then we can change that to send status update).
> 
> Michael Park wrote:
> So currently it is possible for this to happen, but only with a very 
> small probability. Your proposal is to keep the `CHECK` and put in the effort 
> to eliminate the possibility once we observe it as a real problem, correct? 
> The part that I don't quite understand is, what's the motivation to wait for 
> a real problem to occur when we know it's possible to run into this issue 
> (even with a small probability), the effort to change the `CHECK` to sending 
> `TASK_LOST` seems to be small?
> 
> Jie Yu wrote:
> Well, everything has a probablity to fail, the question is how large the 
> probability is. Memory could have hardware errors and a bit could be flipped 
> due to random reasons, does that mean that we have to do parity check in 
> every single location in our code base? I think my point is the probability 
> for this to fail is extremely low so that we shouldn't worry too much.
> 
> I am fine with sending a status update.
> 
> Alexander Rukletsov wrote:
> I wonder, what are the cases when the task launch request may arrive 
> before `CheckpointResourcesMessage`? If my understanding is correct, we do 
> not have delivery guarantee for `CheckpointResourcesMessage`, nor we have the 
> same queue in Master for `CheckpointResourcesMessage` and `RunTaskMessage` to 
> ensure the order. My intuition is that the probability of such an event is 
> not negligible: a network blip can occur and `CheckpointResourcesMessage` may 
> be lost or delayed, we can open another socket to the slave for 
> `RunTaskMessage`. Could you please help me understand that?
> 
> Ben Mahler wrote:
> Such a situation will manifest an `Exited` event from the socket closure. 
> At the application level, we want to ensure that if there are any `Exited` 
> events, the slave (or framework) will re-register. This is currently not 
> fully implemented: currently only the master-side `Exited` is implemented (we 
> ping the slave telling it we think it is disconnected), the slave-side 
> `Exited` is a no-op.
> 
> It may become simpler with the HTTP API since we have a single duplex 
> socket (the master does not initiate a connection with the slave (or 
> framework)). This means that the responsibility of dealing with a closed 
> socket is left to the slave (or framework) only. Off the top of my head, I'm 
> not sure if there are situations where only 1 side of the socket can be 
> broken.. so maybe it will be just as complicated :)
> 
> Let's discuss off this thread, I do have some tickets around this stuff.

Sure, would love to!


- Alexander


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35433/#review87857
---


On June 19, 2015, 2:31 p.m., Michael Park wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35433/
> ---
> 
> (Updated June 19, 2015, 2:31 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Hindman, and Jie Yu.
> 
> 
> Bugs: MESOS-2491
> https://issues.apache.org/jira/browse/MESOS-2491
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> No bug was observed (yet), but realized I forgot about this in the dynamic 
> reservations patches.
> 
> 
> Diffs
> -
> 
>   include/mesos/mesos.proto 8df1211165169c9595e0e6e85b5ddc404345ff70 
>   src/slave/slave.cpp a5ad29f59fadba919ed82ba2892c2febe551660b 
> 
> Diff: https://reviews.apache.org/r/35433/diff/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 

Re: Review Request 35433: Sent StatusUpdates if checkpointed resources don't exist on the slave.

2015-06-19 Thread Michael Park


> On June 19, 2015, 10:57 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 1579-1583
> > 
> >
> > This should be a CHECK instead?

I've created [r35686](https://reviews.apache.org/r/35686/) as a follow-up for 
this.


- Michael


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35433/#review88595
---


On June 19, 2015, 2:31 p.m., Michael Park wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35433/
> ---
> 
> (Updated June 19, 2015, 2:31 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Hindman, and Jie Yu.
> 
> 
> Bugs: MESOS-2491
> https://issues.apache.org/jira/browse/MESOS-2491
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> No bug was observed (yet), but realized I forgot about this in the dynamic 
> reservations patches.
> 
> 
> Diffs
> -
> 
>   include/mesos/mesos.proto 8df1211165169c9595e0e6e85b5ddc404345ff70 
>   src/slave/slave.cpp a5ad29f59fadba919ed82ba2892c2febe551660b 
> 
> Diff: https://reviews.apache.org/r/35433/diff/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Michael Park
> 
>



Re: Review Request 35433: Sent StatusUpdates if checkpointed resources don't exist on the slave.

2015-06-19 Thread Jie Yu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35433/#review88595
---



src/slave/slave.cpp (lines 1579 - 1583)


This should be a CHECK instead?


- Jie Yu


On June 19, 2015, 2:31 p.m., Michael Park wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35433/
> ---
> 
> (Updated June 19, 2015, 2:31 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Hindman, and Jie Yu.
> 
> 
> Bugs: MESOS-2491
> https://issues.apache.org/jira/browse/MESOS-2491
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> No bug was observed (yet), but realized I forgot about this in the dynamic 
> reservations patches.
> 
> 
> Diffs
> -
> 
>   include/mesos/mesos.proto 8df1211165169c9595e0e6e85b5ddc404345ff70 
>   src/slave/slave.cpp a5ad29f59fadba919ed82ba2892c2febe551660b 
> 
> Diff: https://reviews.apache.org/r/35433/diff/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Michael Park
> 
>



Re: Review Request 35433: Sent StatusUpdates if checkpointed resources don't exist on the slave.

2015-06-19 Thread Ben Mahler


> On June 14, 2015, 10:46 a.m., Benjamin Hindman wrote:
> > Just so I understand, does this mean if we happen to get in the unfortunate 
> > situation where a slave has neglected to get the dynamic reservation 
> > because it was just starting up and then it gets the task launch it will 
> > shutdown the slave because the CHECK will fail? I would expect the slave to 
> > simply send a TASK_LOST. Said another way, this is not an assertion our 
> > code guarantees. If instead we were waiting for some kind of an ack from 
> > the slave that it received the dynamic reservation before it send the task 
> > launch then a CHECK would make sense.
> 
> Jie Yu wrote:
> We don't expect this to happen because we always send a 
> CheckpointResourcesMessage before sending the task to the slave and TCP 
> ensures in order delivery (out of order delivery is possible if two sockets 
> are used. it's possible because the way we create ephemeral connections, but 
> this is very unlikely to happen). Master won't send the task to the slave if 
> the slave hasn't registered.
> 
> I would rather keep the CHECK here unless we found that this is a real 
> issue (and then we can change that to send status update).
> 
> Michael Park wrote:
> So currently it is possible for this to happen, but only with a very 
> small probability. Your proposal is to keep the `CHECK` and put in the effort 
> to eliminate the possibility once we observe it as a real problem, correct? 
> The part that I don't quite understand is, what's the motivation to wait for 
> a real problem to occur when we know it's possible to run into this issue 
> (even with a small probability), the effort to change the `CHECK` to sending 
> `TASK_LOST` seems to be small?
> 
> Jie Yu wrote:
> Well, everything has a probablity to fail, the question is how large the 
> probability is. Memory could have hardware errors and a bit could be flipped 
> due to random reasons, does that mean that we have to do parity check in 
> every single location in our code base? I think my point is the probability 
> for this to fail is extremely low so that we shouldn't worry too much.
> 
> I am fine with sending a status update.
> 
> Alexander Rukletsov wrote:
> I wonder, what are the cases when the task launch request may arrive 
> before `CheckpointResourcesMessage`? If my understanding is correct, we do 
> not have delivery guarantee for `CheckpointResourcesMessage`, nor we have the 
> same queue in Master for `CheckpointResourcesMessage` and `RunTaskMessage` to 
> ensure the order. My intuition is that the probability of such an event is 
> not negligible: a network blip can occur and `CheckpointResourcesMessage` may 
> be lost or delayed, we can open another socket to the slave for 
> `RunTaskMessage`. Could you please help me understand that?

Such a situation will manifest an `Exited` event from the socket closure. At 
the application level, we want to ensure that if there are any `Exited` events, 
the slave (or framework) will re-register. This is currently not fully 
implemented: currently only the master-side `Exited` is implemented (we ping 
the slave telling it we think it is disconnected), the slave-side `Exited` is a 
no-op.

It may become simpler with the HTTP API since we have a single duplex socket 
(the master does not initiate a connection with the slave (or framework)). This 
means that the responsibility of dealing with a closed socket is left to the 
slave (or framework) only. Off the top of my head, I'm not sure if there are 
situations where only 1 side of the socket can be broken.. so maybe it will be 
just as complicated :)

Let's discuss off this thread, I do have some tickets around this stuff.


- Ben


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35433/#review87857
---


On June 19, 2015, 2:31 p.m., Michael Park wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35433/
> ---
> 
> (Updated June 19, 2015, 2:31 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Hindman, and Jie Yu.
> 
> 
> Bugs: MESOS-2491
> https://issues.apache.org/jira/browse/MESOS-2491
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> No bug was observed (yet), but realized I forgot about this in the dynamic 
> reservations patches.
> 
> 
> Diffs
> -
> 
>   include/mesos/mesos.proto 8df1211165169c9595e0e6e85b5ddc404345ff70 
>   src/slave/slave.cpp a5ad29f59fadba919ed82ba2892c2febe551660b 
> 
> Diff: https://reviews.apache.org/r/35433/diff/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Michael Park
> 
>



Re: Review Request 35433: Sent StatusUpdates if checkpointed resources don't exist on the slave.

2015-06-19 Thread Benjamin Hindman

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35433/#review88568
---

Ship it!


Ship It!

- Benjamin Hindman


On June 19, 2015, 2:31 p.m., Michael Park wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35433/
> ---
> 
> (Updated June 19, 2015, 2:31 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Hindman, and Jie Yu.
> 
> 
> Bugs: MESOS-2491
> https://issues.apache.org/jira/browse/MESOS-2491
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> No bug was observed (yet), but realized I forgot about this in the dynamic 
> reservations patches.
> 
> 
> Diffs
> -
> 
>   include/mesos/mesos.proto 8df1211165169c9595e0e6e85b5ddc404345ff70 
>   src/slave/slave.cpp a5ad29f59fadba919ed82ba2892c2febe551660b 
> 
> Diff: https://reviews.apache.org/r/35433/diff/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Michael Park
> 
>



Re: Review Request 35433: Sent StatusUpdates if checkpointed resources don't exist on the slave.

2015-06-19 Thread Alexander Rukletsov

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35433/#review88541
---

Ship it!


Ship It!

- Alexander Rukletsov


On June 19, 2015, 2:31 p.m., Michael Park wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35433/
> ---
> 
> (Updated June 19, 2015, 2:31 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Hindman, and Jie Yu.
> 
> 
> Bugs: MESOS-2491
> https://issues.apache.org/jira/browse/MESOS-2491
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> No bug was observed (yet), but realized I forgot about this in the dynamic 
> reservations patches.
> 
> 
> Diffs
> -
> 
>   include/mesos/mesos.proto 8df1211165169c9595e0e6e85b5ddc404345ff70 
>   src/slave/slave.cpp a5ad29f59fadba919ed82ba2892c2febe551660b 
> 
> Diff: https://reviews.apache.org/r/35433/diff/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Michael Park
> 
>



Re: Review Request 35433: Sent StatusUpdates if checkpointed resources don't exist on the slave.

2015-06-19 Thread Michael Park

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35433/
---

(Updated June 19, 2015, 2:31 p.m.)


Review request for mesos, Alexander Rukletsov, Benjamin Hindman, and Jie Yu.


Changes
---

Added `TaskStatus::REASON_RESOURCES_UNKNOWN`
Added comments around why we need:
```cpp
if (framework->executors.empty() && framework->pending.empty()) {
  removeFramework(framework);
}
```
before we exit `_runTask`.


Bugs: MESOS-2491
https://issues.apache.org/jira/browse/MESOS-2491


Repository: mesos


Description
---

No bug was observed (yet), but realized I forgot about this in the dynamic 
reservations patches.


Diffs (updated)
-

  include/mesos/mesos.proto 8df1211165169c9595e0e6e85b5ddc404345ff70 
  src/slave/slave.cpp a5ad29f59fadba919ed82ba2892c2febe551660b 

Diff: https://reviews.apache.org/r/35433/diff/


Testing
---

`make check`


Thanks,

Michael Park



Re: Review Request 35433: Sent StatusUpdates if checkpointed resources don't exist on the slave.

2015-06-19 Thread Michael Park

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35433/
---

(Updated June 19, 2015, 12:29 p.m.)


Review request for mesos, Alexander Rukletsov, Benjamin Hindman, and Jie Yu.


Changes
---

Link [MESOS-2491](https://issues.apache.org/jira/browse/MESOS-2491)


Bugs: MESOS-2491
https://issues.apache.org/jira/browse/MESOS-2491


Repository: mesos


Description
---

No bug was observed (yet), but realized I forgot about this in the dynamic 
reservations patches.


Diffs
-

  src/slave/slave.cpp a5ad29f59fadba919ed82ba2892c2febe551660b 

Diff: https://reviews.apache.org/r/35433/diff/


Testing
---

`make check`


Thanks,

Michael Park



Re: Review Request 35433: Sent StatusUpdates if checkpointed resources don't exist on the slave.

2015-06-19 Thread Alexander Rukletsov

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35433/#review88513
---

Ship it!


Want to link the RR with MESOS-2491 for posterity?


src/slave/slave.cpp (line 1399)


I believe you've chosen `TASK_LOST` because the appropriate 
`CheckpointResourcesMessage` is about to arrive and restarting the task may 
succeed. If this is the case, let's expand the comment.



src/slave/slave.cpp (lines 1409 - 1416)


Let's add `TaskStatus::Reason` for that! How about 
`REASON_RESOURCES_UNKNOWN`? I'm ok with doing it in a separate RR in order not 
to block this patch.



src/slave/slave.cpp (lines 1420 - 1422)


I've seen your discussion about these lines and that you plan to follow-up 
on this. In the meantime, mind throwing a comment, why do we need to remove the 
framework here? (I believe it is because we could have created it one step 
before in `runTask()`, right?)



src/slave/slave.cpp (line 1446)


Ditto.


- Alexander Rukletsov


On June 19, 2015, 12:42 a.m., Michael Park wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35433/
> ---
> 
> (Updated June 19, 2015, 12:42 a.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Hindman, and Jie Yu.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> No bug was observed (yet), but realized I forgot about this in the dynamic 
> reservations patches.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp a5ad29f59fadba919ed82ba2892c2febe551660b 
> 
> Diff: https://reviews.apache.org/r/35433/diff/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Michael Park
> 
>



Re: Review Request 35433: Sent StatusUpdates if checkpointed resources don't exist on the slave.

2015-06-19 Thread Alexander Rukletsov


> On June 14, 2015, 10:46 a.m., Benjamin Hindman wrote:
> > Just so I understand, does this mean if we happen to get in the unfortunate 
> > situation where a slave has neglected to get the dynamic reservation 
> > because it was just starting up and then it gets the task launch it will 
> > shutdown the slave because the CHECK will fail? I would expect the slave to 
> > simply send a TASK_LOST. Said another way, this is not an assertion our 
> > code guarantees. If instead we were waiting for some kind of an ack from 
> > the slave that it received the dynamic reservation before it send the task 
> > launch then a CHECK would make sense.
> 
> Jie Yu wrote:
> We don't expect this to happen because we always send a 
> CheckpointResourcesMessage before sending the task to the slave and TCP 
> ensures in order delivery (out of order delivery is possible if two sockets 
> are used. it's possible because the way we create ephemeral connections, but 
> this is very unlikely to happen). Master won't send the task to the slave if 
> the slave hasn't registered.
> 
> I would rather keep the CHECK here unless we found that this is a real 
> issue (and then we can change that to send status update).
> 
> Michael Park wrote:
> So currently it is possible for this to happen, but only with a very 
> small probability. Your proposal is to keep the `CHECK` and put in the effort 
> to eliminate the possibility once we observe it as a real problem, correct? 
> The part that I don't quite understand is, what's the motivation to wait for 
> a real problem to occur when we know it's possible to run into this issue 
> (even with a small probability), the effort to change the `CHECK` to sending 
> `TASK_LOST` seems to be small?
> 
> Jie Yu wrote:
> Well, everything has a probablity to fail, the question is how large the 
> probability is. Memory could have hardware errors and a bit could be flipped 
> due to random reasons, does that mean that we have to do parity check in 
> every single location in our code base? I think my point is the probability 
> for this to fail is extremely low so that we shouldn't worry too much.
> 
> I am fine with sending a status update.

I wonder, what are the cases when the task launch request may arrive before 
`CheckpointResourcesMessage`? If my understanding is correct, we do not have 
delivery guarantee for `CheckpointResourcesMessage`, nor we have the same queue 
in Master for `CheckpointResourcesMessage` and `RunTaskMessage` to ensure the 
order. My intuition is that the probability of such an event is not negligible: 
a network blip can occur and `CheckpointResourcesMessage` may be lost or 
delayed, we can open another socket to the slave for `RunTaskMessage`. Could 
you please help me understand that?


- Alexander


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35433/#review87857
---


On June 19, 2015, 12:42 a.m., Michael Park wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35433/
> ---
> 
> (Updated June 19, 2015, 12:42 a.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Hindman, and Jie Yu.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> No bug was observed (yet), but realized I forgot about this in the dynamic 
> reservations patches.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp a5ad29f59fadba919ed82ba2892c2febe551660b 
> 
> Diff: https://reviews.apache.org/r/35433/diff/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Michael Park
> 
>



Re: Review Request 35433: Sent StatusUpdates if checkpointed resources don't exist on the slave.

2015-06-18 Thread Michael Park


> On June 18, 2015, 5:27 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, line 1409
> > 
> >
> > Have we decided to not capture temp variable by const ref? Maybe you 
> > want to do a sweep to fix that in this file?

I fixed it in this review to begin with, but decided that it would be easier 
for folks to review if I split it out: https://reviews.apache.org/r/35638


- Michael


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35433/#review88406
---


On June 19, 2015, 12:42 a.m., Michael Park wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35433/
> ---
> 
> (Updated June 19, 2015, 12:42 a.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Hindman, and Jie Yu.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> No bug was observed (yet), but realized I forgot about this in the dynamic 
> reservations patches.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp a5ad29f59fadba919ed82ba2892c2febe551660b 
> 
> Diff: https://reviews.apache.org/r/35433/diff/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Michael Park
> 
>



Re: Review Request 35433: Sent StatusUpdates if checkpointed resources don't exist on the slave.

2015-06-18 Thread Michael Park

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35433/
---

(Updated June 19, 2015, 12:42 a.m.)


Review request for mesos, Alexander Rukletsov, Benjamin Hindman, and Jie Yu.


Summary (updated)
-

Sent StatusUpdates if checkpointed resources don't exist on the slave.


Repository: mesos


Description
---

No bug was observed (yet), but realized I forgot about this in the dynamic 
reservations patches.


Diffs
-

  src/slave/slave.cpp a5ad29f59fadba919ed82ba2892c2febe551660b 

Diff: https://reviews.apache.org/r/35433/diff/


Testing
---

`make check`


Thanks,

Michael Park