Re: Review Request 34720: Added kill executor correction to slave.

2015-06-16 Thread Jie Yu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review88124
---

Ship it!


LGTM. Some nits on logging.


src/slave/slave.cpp (line 4137)
https://reviews.apache.org/r/34720/#comment140520

I think we already have
```
using mesos::slave::QoSController;
```

So you can remove the namespace prefix. It probably will fit in one line.



src/slave/slave.cpp (line 4161)
https://reviews.apache.org/r/34720/#comment140522

Ditto on removing mesos::slave:: prefix. Please do a sweep.



src/slave/slave.cpp (line 4172)
https://reviews.apache.org/r/34720/#comment140526

s/kill/KILL/

Here and everywhere else. Also, to be consistent, let's use the same 
logging format:

```
Ignoring QoS KILL correction: framework id not specified.
```



src/slave/slave.cpp (lines 4182 - 4183)
https://reviews.apache.org/r/34720/#comment140533

```
Ignoring QoS KILL correction on framework ...: executor id not specified.
```



src/slave/slave.cpp (line 4191)
https://reviews.apache.org/r/34720/#comment140527

```Ignoring QoS KILL correction on framework ...: framework cannot be 
found```



src/slave/slave.cpp (line 4202)
https://reviews.apache.org/r/34720/#comment140528

Ignoring QoS KILL correction on framework ...: framework is terminating.



src/slave/slave.cpp (line 4209)
https://reviews.apache.org/r/34720/#comment140525

Shouldn't be 'executorId' here? Also, can you put single quotes for 
executorId since it's specifed by the user (i.e., might contain spaces).



src/slave/slave.cpp (lines 4209 - 4211)
https://reviews.apache.org/r/34720/#comment140529

To be consistent, here should be:
```
Ignoring QoS KILL correction on executor ... of framework ...: executor 
cannot be found
```



src/slave/slave.cpp (lines 4218 - 4220)
https://reviews.apache.org/r/34720/#comment140530

HOw about
```
Applying QoS KILL correction on executor ... of framework ...
```



src/slave/slave.cpp (line 4219)
https://reviews.apache.org/r/34720/#comment140524

YOu forgot the single quote here.



src/slave/slave.cpp (lines 4236 - 4238)
https://reviews.apache.org/r/34720/#comment140531

Ditto on consistency:
```
Ignoring QoS KILL correction on executor ... of framework ...: executor is 
XXX
```



src/slave/slave.cpp (line 4247)
https://reviews.apache.org/r/34720/#comment140532

QoS correction type ... is not supported.


- Jie Yu


On June 16, 2015, 8:42 p.m., Niklas Nielsen wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34720/
 ---
 
 (Updated June 16, 2015, 8:42 p.m.)
 
 
 Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
 Vinod Kone.
 
 
 Bugs: MESOS-2653
 https://issues.apache.org/jira/browse/MESOS-2653
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See summary
 
 
 Diffs
 -
 
   src/slave/flags.hpp 6c24e56d15881b0e3aeec3d4824842cf57121fc6 
   src/slave/flags.cpp 93690cfa9dbf2658ce642829299f4adf08bb1062 
   src/slave/slave.hpp dbed46d76cc7fbdd1a8d3ebcc2a1ff08b75da10f 
   src/slave/slave.cpp 361433063b5cb38d8326f8247cf4fc6f4a18e5c9 
   src/tests/mesos.hpp ecdf9109d2e46e8730754eeeb4978863679d56e7 
   src/tests/mesos.cpp 509f9f205fdb1fa094e313b6f0da53000ffecbb3 
 
 Diff: https://reviews.apache.org/r/34720/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Niklas Nielsen
 




Re: Review Request 34720: Added kill executor correction to slave.

2015-06-16 Thread Jie Yu


 On June 16, 2015, 9:22 p.m., Jie Yu wrote:
  LGTM. Some nits on logging.

IN fact, can you consistently use `QoS correction KILL` (instead of QoS KILL 
correction) in the logging?


- Jie


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review88124
---


On June 16, 2015, 8:42 p.m., Niklas Nielsen wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34720/
 ---
 
 (Updated June 16, 2015, 8:42 p.m.)
 
 
 Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
 Vinod Kone.
 
 
 Bugs: MESOS-2653
 https://issues.apache.org/jira/browse/MESOS-2653
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See summary
 
 
 Diffs
 -
 
   src/slave/flags.hpp 6c24e56d15881b0e3aeec3d4824842cf57121fc6 
   src/slave/flags.cpp 93690cfa9dbf2658ce642829299f4adf08bb1062 
   src/slave/slave.hpp dbed46d76cc7fbdd1a8d3ebcc2a1ff08b75da10f 
   src/slave/slave.cpp 361433063b5cb38d8326f8247cf4fc6f4a18e5c9 
   src/tests/mesos.hpp ecdf9109d2e46e8730754eeeb4978863679d56e7 
   src/tests/mesos.cpp 509f9f205fdb1fa094e313b6f0da53000ffecbb3 
 
 Diff: https://reviews.apache.org/r/34720/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Niklas Nielsen
 




Re: Review Request 34720: Added kill executor correction to slave.

2015-06-16 Thread Niklas Nielsen


 On June 12, 2015, 3:49 p.m., Jie Yu wrote:
  src/slave/slave.cpp, lines 4203-4210
  https://reviews.apache.org/r/34720/diff/5/?file=983931#file983931line4203
 
  Can you add a TODO here saying that we may want to  check if 
  containerId matches or not due to race (i.e., qos controller wants to kill 
  container 1 of executor 1, but the executor you are checking here is for 
  container 2 of executor 1).
  
  That means we need to add ContainerID in ResourceUsage/QoSCorrection as 
  well

Added MESOS-2875 :)


- Niklas


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review87767
---


On June 16, 2015, 1:42 p.m., Niklas Nielsen wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34720/
 ---
 
 (Updated June 16, 2015, 1:42 p.m.)
 
 
 Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
 Vinod Kone.
 
 
 Bugs: MESOS-2653
 https://issues.apache.org/jira/browse/MESOS-2653
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See summary
 
 
 Diffs
 -
 
   src/slave/flags.hpp 6c24e56d15881b0e3aeec3d4824842cf57121fc6 
   src/slave/flags.cpp 93690cfa9dbf2658ce642829299f4adf08bb1062 
   src/slave/slave.hpp dbed46d76cc7fbdd1a8d3ebcc2a1ff08b75da10f 
   src/slave/slave.cpp 361433063b5cb38d8326f8247cf4fc6f4a18e5c9 
   src/tests/mesos.hpp ecdf9109d2e46e8730754eeeb4978863679d56e7 
   src/tests/mesos.cpp 509f9f205fdb1fa094e313b6f0da53000ffecbb3 
 
 Diff: https://reviews.apache.org/r/34720/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Niklas Nielsen
 




Re: Review Request 34720: Added kill executor correction to slave.

2015-06-16 Thread Niklas Nielsen


 On June 16, 2015, 2:22 p.m., Jie Yu wrote:
  src/slave/slave.cpp, line 4210
  https://reviews.apache.org/r/34720/diff/6/?file=985948#file985948line4210
 
  Shouldn't be 'executorId' here? Also, can you put single quotes for 
  executorId since it's specifed by the user (i.e., might contain spaces).

Great point! Thanks!


 On June 16, 2015, 2:22 p.m., Jie Yu wrote:
  src/slave/slave.cpp, line 4138
  https://reviews.apache.org/r/34720/diff/6/?file=985948#file985948line4138
 
  I think we already have
  ```
  using mesos::slave::QoSController;
  ```
  
  So you can remove the namespace prefix. It probably will fit in one 
  line.

Added 'using mesos::slave::QoSCorrection;'


- Niklas


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review88124
---


On June 16, 2015, 4:47 p.m., Niklas Nielsen wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34720/
 ---
 
 (Updated June 16, 2015, 4:47 p.m.)
 
 
 Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
 Vinod Kone.
 
 
 Bugs: MESOS-2653
 https://issues.apache.org/jira/browse/MESOS-2653
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See summary
 
 
 Diffs
 -
 
   src/slave/flags.hpp 6c24e56d15881b0e3aeec3d4824842cf57121fc6 
   src/slave/flags.cpp 93690cfa9dbf2658ce642829299f4adf08bb1062 
   src/slave/slave.hpp dbed46d76cc7fbdd1a8d3ebcc2a1ff08b75da10f 
   src/slave/slave.cpp 361433063b5cb38d8326f8247cf4fc6f4a18e5c9 
   src/tests/mesos.hpp ecdf9109d2e46e8730754eeeb4978863679d56e7 
   src/tests/mesos.cpp 509f9f205fdb1fa094e313b6f0da53000ffecbb3 
   src/tests/oversubscription_tests.cpp 
 3481ad2eef43c3860642970b4c96494997de8552 
 
 Diff: https://reviews.apache.org/r/34720/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Niklas Nielsen
 




Re: Review Request 34720: Added kill executor correction to slave.

2015-06-12 Thread Niklas Nielsen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/
---

(Updated June 12, 2015, 2:48 p.m.)


Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
Vinod Kone.


Bugs: MESOS-2653
https://issues.apache.org/jira/browse/MESOS-2653


Repository: mesos


Description
---

See summary


Diffs (updated)
-

  src/slave/slave.hpp 0df1b55791963fb4159b7ea5318d09dde4f7d8c7 
  src/slave/slave.cpp 9af69d8f0b28c9441c684886c52320378f9b2869 

Diff: https://reviews.apache.org/r/34720/diff/


Testing
---

make check


Thanks,

Niklas Nielsen



Re: Review Request 34720: Added kill executor correction to slave.

2015-06-12 Thread Jie Yu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review87767
---



src/slave/slave.hpp
https://reviews.apache.org/r/34720/#comment140194

The 'reason' is for the slave to encode the reason  behind a terminal 
status update for those pending/unterminated tasks when executor is terminated.



src/slave/slave.cpp
https://reviews.apache.org/r/34720/#comment140186

Do you want to introduce a configurable minimal interval between two 
'corrections()' pull? (like we did for resource estimator)?

You probably want to follow the implementation of `fowardOversubscribed` 
and `_forwardOversubscribed`.



src/slave/slave.cpp
https://reviews.apache.org/r/34720/#comment140192

 because the framework cannot be found



src/slave/slave.cpp
https://reviews.apache.org/r/34720/#comment140189

indent



src/slave/slave.cpp
https://reviews.apache.org/r/34720/#comment140191

is not running is confusing. How about:

because the executor cannot be found



src/slave/slave.cpp
https://reviews.apache.org/r/34720/#comment140193

Can you add a TODO here saying that we may want to  check if containerId 
matches or not due to race (i.e., qos controller wants to kill container 1 of 
executor 1, but the executor you are checking here is for container 2 of 
executor 1).

That means we need to add ContainerID in ResourceUsage/QoSCorrection as well



src/slave/slave.cpp
https://reviews.apache.org/r/34720/#comment140190

because the executor is   executor-state



src/slave/slave.cpp
https://reviews.apache.org/r/34720/#comment140195

overrides `executor-reason`



src/slave/slave.cpp
https://reviews.apache.org/r/34720/#comment140196

infer ... from 'killed' field in 'termination'.


- Jie Yu


On June 12, 2015, 9:48 p.m., Niklas Nielsen wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34720/
 ---
 
 (Updated June 12, 2015, 9:48 p.m.)
 
 
 Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
 Vinod Kone.
 
 
 Bugs: MESOS-2653
 https://issues.apache.org/jira/browse/MESOS-2653
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See summary
 
 
 Diffs
 -
 
   src/slave/slave.hpp 0df1b55791963fb4159b7ea5318d09dde4f7d8c7 
   src/slave/slave.cpp 9af69d8f0b28c9441c684886c52320378f9b2869 
 
 Diff: https://reviews.apache.org/r/34720/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Niklas Nielsen
 




Re: Review Request 34720: Added kill executor correction to slave.

2015-06-12 Thread Jie Yu


 On June 12, 2015, 6:43 p.m., Jie Yu wrote:
  src/slave/slave.cpp, lines 4383-4400
  https://reviews.apache.org/r/34720/diff/4/?file=983235#file983235line4383
 
  First of all, we don't need to check if 'executor' is NULL or not here. 
  If you looked at the callers, it's guaranteed that 'executor' will be 
  non-NULL. So you need a CHECK here.
  
  Also, I thought about what the correct logic should be in this funtion 
  (see below). Let me know your thoughts. (we have too many tech debts in 
  this function, so we better fix this asap, instead of waiting further).
  
  ```
  TaskState state;
  TaskStatus::Reason reason;
  string message;
  
  // Determine the state and reason for the status update.
  if (termination.isReady()  termination.get().has_reason()) {
// This is the case where the resource limit has been
// reached and the container destroyed itself.
state = TASK_FAILED;
reason = termination.get().reason();
  } else if (executor-reason.isSome()) {
// This is the case where the slave initiates the
// termination of the container.

// NOTE: For command executors, previously we send TASK_FAILED
// and REASON_COMMAND_EXECUTOR_FAILED when the command executor
// failed to launch (e.g., invalid executor path).
if (executor-isCommandExecutor() 
executor-reason.get() == REASON_EXECUTOR_REGISTRATION_TIMEOUT) {
  state = TASK_FAILED;
  reason = REASON_COMMAND_EXECUTOR_FAILED;
} else {
  state = TASK_LOST;
  reason = executor-reason.get();
}
  } else {
state = TASK_LOST;
reason = TaskStatus::REASON_EXECUTOR_TERMINATED
  }
  
  // Determine the message for the status update.
  // TODO(nnielsen): Consider adding a message from the slave
  // as well, and concatenating the messages.
  if (termination.isReady()) {
message = termination.get().message();
  } else {
// This is case where the containerizer fails
// to destroy the container.
message = Abnormal executor termination; 
  }
  
  statusUpdate(...);
  ```
 
 Jie Yu wrote:
 Forgot to mention that you need to set executor-reason accordingly when 
 slave initiates a destroy of the container (e.g., containerizer-update 
 fails, executor registration timeout, etc.).
 
 Vinod Kone wrote:
 Changing the command task state from TASK_FAILED to TASK_LOST is an API 
 change! It should've a deprecation cycle with a headsup given on the dev 
 list. I would recommend doing:
 
 ```
 // Determine the state and reason for the status update.
 if (termination.isReady()  termination.get().has_reason()) {
   // This is the case where the resource limit has been
   // reached and the container destroyed itself.
   state = TASK_FAILED;
   reason = termination.get().reason();
 } else  if (executor-isCommandExecutor()) {
 // TODO(jieyu): .
 state = TASK_FAILED;
 reason = REASON_COMMAND_EXECUTOR_FAILED;
 } else if (executor-reason.isSome()) {
   // This is the case where the slave initiates the
   // termination of the container.
 state = TASK_LOST;
 reason = executor-reason.get();
 } else {
   state = TASK_LOST;
   reason = TaskStatus::REASON_EXECUTOR_TERMINATED
 }
 
 ```

OK, that's why exactly the reason why I want this to be fixed this asap 
(otherwise, we'll have to deal with another API breaking change).

Vinod, your code does not work for niq. For example, if a command executor is 
killed due to QOS violation, we will send TASK_FAILED and 
REASON_COMMAND_EXECUTOR_FAILED. That's not correct.


- Jie


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review87734
---


On June 11, 2015, 10:58 p.m., Niklas Nielsen wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34720/
 ---
 
 (Updated June 11, 2015, 10:58 p.m.)
 
 
 Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
 Vinod Kone.
 
 
 Bugs: MESOS-2653
 https://issues.apache.org/jira/browse/MESOS-2653
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See summary
 
 
 Diffs
 -
 
   src/slave/slave.hpp 0df1b55791963fb4159b7ea5318d09dde4f7d8c7 
   src/slave/slave.cpp b523c2fce50e56f4f94d55a9488f49c53452e4d4 
 
 Diff: https://reviews.apache.org/r/34720/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Niklas Nielsen
 




Re: Review Request 34720: Added kill executor correction to slave.

2015-06-11 Thread Niklas Nielsen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/
---

(Updated June 11, 2015, 3:58 p.m.)


Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
Vinod Kone.


Bugs: MESOS-2653
https://issues.apache.org/jira/browse/MESOS-2653


Repository: mesos


Description
---

See summary


Diffs (updated)
-

  src/slave/slave.hpp 0df1b55791963fb4159b7ea5318d09dde4f7d8c7 
  src/slave/slave.cpp b523c2fce50e56f4f94d55a9488f49c53452e4d4 

Diff: https://reviews.apache.org/r/34720/diff/


Testing
---

make check


Thanks,

Niklas Nielsen



Re: Review Request 34720: Added kill executor correction to slave.

2015-06-04 Thread Jie Yu


 On June 4, 2015, 5:58 p.m., Vinod Kone wrote:
  src/slave/slave.cpp, lines 4400-4413
  https://reviews.apache.org/r/34720/diff/3/?file=979257#file979257line4400
 
  This is getting a little hairy. As the TODO says we really ought bubble 
  this up via the Termination protobuf. Have you looked into it?

+1

It really gets nasty here. I remembered that jaybuff said that they plan to fix 
that. Maybe you want to sync with them first.


- Jie


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review86669
---


On June 4, 2015, 5:43 p.m., Niklas Nielsen wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34720/
 ---
 
 (Updated June 4, 2015, 5:43 p.m.)
 
 
 Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
 Vinod Kone.
 
 
 Bugs: MESOS-2653
 https://issues.apache.org/jira/browse/MESOS-2653
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See summary
 
 
 Diffs
 -
 
   src/slave/slave.hpp 37e85af90275dc0c88c12b39a5813e7d71abb1f3 
   src/slave/slave.cpp 30e0d8ba2a2b12a0a7f1f7246b65360de30ef4ea 
 
 Diff: https://reviews.apache.org/r/34720/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Niklas Nielsen
 




Re: Review Request 34720: Added kill executor correction to slave.

2015-06-04 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review86669
---



src/slave/slave.cpp
https://reviews.apache.org/r/34720/#comment138734

This is getting a little hairy. As the TODO says we really ought bubble 
this up via the Termination protobuf. Have you looked into it?


- Vinod Kone


On June 4, 2015, 5:43 p.m., Niklas Nielsen wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34720/
 ---
 
 (Updated June 4, 2015, 5:43 p.m.)
 
 
 Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
 Vinod Kone.
 
 
 Bugs: MESOS-2653
 https://issues.apache.org/jira/browse/MESOS-2653
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See summary
 
 
 Diffs
 -
 
   src/slave/slave.hpp 37e85af90275dc0c88c12b39a5813e7d71abb1f3 
   src/slave/slave.cpp 30e0d8ba2a2b12a0a7f1f7246b65360de30ef4ea 
 
 Diff: https://reviews.apache.org/r/34720/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Niklas Nielsen
 




Re: Review Request 34720: Added kill executor correction to slave.

2015-06-04 Thread Jie Yu


 On June 4, 2015, 5:58 p.m., Vinod Kone wrote:
  src/slave/slave.cpp, lines 4400-4413
  https://reviews.apache.org/r/34720/diff/3/?file=979257#file979257line4400
 
  This is getting a little hairy. As the TODO says we really ought bubble 
  this up via the Termination protobuf. Have you looked into it?
 
 Jie Yu wrote:
 +1
 
 It really gets nasty here. I remembered that jaybuff said that they plan 
 to fix that. Maybe you want to sync with them first.
 
 Niklas Nielsen wrote:
 Mind adding a few references? Where is the Termination proto?

Some references for the context:
https://issues.apache.org/jira/browse/MESOS-2035
https://reviews.apache.org/r/33249/ (see discussions in the review)
https://issues.apache.org/jira/browse/MESOS-2657


- Jie


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review86669
---


On June 4, 2015, 5:43 p.m., Niklas Nielsen wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34720/
 ---
 
 (Updated June 4, 2015, 5:43 p.m.)
 
 
 Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
 Vinod Kone.
 
 
 Bugs: MESOS-2653
 https://issues.apache.org/jira/browse/MESOS-2653
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See summary
 
 
 Diffs
 -
 
   src/slave/slave.hpp 37e85af90275dc0c88c12b39a5813e7d71abb1f3 
   src/slave/slave.cpp 30e0d8ba2a2b12a0a7f1f7246b65360de30ef4ea 
 
 Diff: https://reviews.apache.org/r/34720/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Niklas Nielsen
 




Re: Review Request 34720: Added kill executor correction to slave.

2015-06-04 Thread Niklas Nielsen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/
---

(Updated June 4, 2015, 10:43 a.m.)


Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
Vinod Kone.


Bugs: MESOS-2653
https://issues.apache.org/jira/browse/MESOS-2653


Repository: mesos


Description
---

See summary


Diffs (updated)
-

  src/slave/slave.hpp 37e85af90275dc0c88c12b39a5813e7d71abb1f3 
  src/slave/slave.cpp 30e0d8ba2a2b12a0a7f1f7246b65360de30ef4ea 

Diff: https://reviews.apache.org/r/34720/diff/


Testing
---

make check


Thanks,

Niklas Nielsen



Re: Review Request 34720: Added kill executor correction to slave.

2015-06-04 Thread Niklas Nielsen


 On June 4, 2015, 10:58 a.m., Vinod Kone wrote:
  src/slave/slave.cpp, lines 4400-4413
  https://reviews.apache.org/r/34720/diff/3/?file=979257#file979257line4400
 
  This is getting a little hairy. As the TODO says we really ought bubble 
  this up via the Termination protobuf. Have you looked into it?
 
 Jie Yu wrote:
 +1
 
 It really gets nasty here. I remembered that jaybuff said that they plan 
 to fix that. Maybe you want to sync with them first.

Mind adding a few references? Where is the Termination proto?


- Niklas


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review86669
---


On June 4, 2015, 10:43 a.m., Niklas Nielsen wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34720/
 ---
 
 (Updated June 4, 2015, 10:43 a.m.)
 
 
 Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
 Vinod Kone.
 
 
 Bugs: MESOS-2653
 https://issues.apache.org/jira/browse/MESOS-2653
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See summary
 
 
 Diffs
 -
 
   src/slave/slave.hpp 37e85af90275dc0c88c12b39a5813e7d71abb1f3 
   src/slave/slave.cpp 30e0d8ba2a2b12a0a7f1f7246b65360de30ef4ea 
 
 Diff: https://reviews.apache.org/r/34720/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Niklas Nielsen
 




Re: Review Request 34720: Added kill executor correction to slave.

2015-06-04 Thread Jie Yu


 On June 4, 2015, 5:58 p.m., Vinod Kone wrote:
  src/slave/slave.cpp, lines 4400-4413
  https://reviews.apache.org/r/34720/diff/3/?file=979257#file979257line4400
 
  This is getting a little hairy. As the TODO says we really ought bubble 
  this up via the Termination protobuf. Have you looked into it?
 
 Jie Yu wrote:
 +1
 
 It really gets nasty here. I remembered that jaybuff said that they plan 
 to fix that. Maybe you want to sync with them first.
 
 Niklas Nielsen wrote:
 Mind adding a few references? Where is the Termination proto?
 
 Jie Yu wrote:
 Some references for the context:
 https://issues.apache.org/jira/browse/MESOS-2035
 https://reviews.apache.org/r/33249/ (see discussions in the review)
 https://issues.apache.org/jira/browse/MESOS-2657
 
 Niklas Nielsen wrote:
 Can you be a bit more concrete about your concerns and suggested path 
 forward? Are you suggesting to land 2035 first? Is it the qosKilled bool that 
 concerns you?

My concern is that 'killed' field in Termination is overloaded. If it is true, 
it means either 1) user initiated killTask 2) killed due to container limit has 
been reached (e.g., memory, it's buggy already since we also have disk limit 
now). 3) killed due to QoS controller.

Therefore, relying on this field to decide what 'reason' and 'status' to send 
back is very unintuitive and hard to understand. Just feel that this is a tech 
debt we should fix asap.

I don't have a suggestion yet. I need to think about it.


- Jie


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review86669
---


On June 4, 2015, 5:43 p.m., Niklas Nielsen wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34720/
 ---
 
 (Updated June 4, 2015, 5:43 p.m.)
 
 
 Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
 Vinod Kone.
 
 
 Bugs: MESOS-2653
 https://issues.apache.org/jira/browse/MESOS-2653
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See summary
 
 
 Diffs
 -
 
   src/slave/slave.hpp 37e85af90275dc0c88c12b39a5813e7d71abb1f3 
   src/slave/slave.cpp 30e0d8ba2a2b12a0a7f1f7246b65360de30ef4ea 
 
 Diff: https://reviews.apache.org/r/34720/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Niklas Nielsen
 




Re: Review Request 34720: Added kill executor correction to slave.

2015-06-04 Thread Niklas Nielsen


 On June 4, 2015, 10:58 a.m., Vinod Kone wrote:
  src/slave/slave.cpp, lines 4400-4413
  https://reviews.apache.org/r/34720/diff/3/?file=979257#file979257line4400
 
  This is getting a little hairy. As the TODO says we really ought bubble 
  this up via the Termination protobuf. Have you looked into it?
 
 Jie Yu wrote:
 +1
 
 It really gets nasty here. I remembered that jaybuff said that they plan 
 to fix that. Maybe you want to sync with them first.
 
 Niklas Nielsen wrote:
 Mind adding a few references? Where is the Termination proto?
 
 Jie Yu wrote:
 Some references for the context:
 https://issues.apache.org/jira/browse/MESOS-2035
 https://reviews.apache.org/r/33249/ (see discussions in the review)
 https://issues.apache.org/jira/browse/MESOS-2657

Can you be a bit more concrete about your concerns and suggested path forward? 
Are you suggesting to land 2035 first? Is it the qosKilled bool that concerns 
you?


- Niklas


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review86669
---


On June 4, 2015, 10:43 a.m., Niklas Nielsen wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34720/
 ---
 
 (Updated June 4, 2015, 10:43 a.m.)
 
 
 Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
 Vinod Kone.
 
 
 Bugs: MESOS-2653
 https://issues.apache.org/jira/browse/MESOS-2653
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See summary
 
 
 Diffs
 -
 
   src/slave/slave.hpp 37e85af90275dc0c88c12b39a5813e7d71abb1f3 
   src/slave/slave.cpp 30e0d8ba2a2b12a0a7f1f7246b65360de30ef4ea 
 
 Diff: https://reviews.apache.org/r/34720/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Niklas Nielsen
 




Re: Review Request 34720: Added kill executor correction to slave.

2015-06-04 Thread Niklas Nielsen


 On June 4, 2015, 10:58 a.m., Vinod Kone wrote:
  src/slave/slave.cpp, lines 4400-4413
  https://reviews.apache.org/r/34720/diff/3/?file=979257#file979257line4400
 
  This is getting a little hairy. As the TODO says we really ought bubble 
  this up via the Termination protobuf. Have you looked into it?
 
 Jie Yu wrote:
 +1
 
 It really gets nasty here. I remembered that jaybuff said that they plan 
 to fix that. Maybe you want to sync with them first.
 
 Niklas Nielsen wrote:
 Mind adding a few references? Where is the Termination proto?
 
 Jie Yu wrote:
 Some references for the context:
 https://issues.apache.org/jira/browse/MESOS-2035
 https://reviews.apache.org/r/33249/ (see discussions in the review)
 https://issues.apache.org/jira/browse/MESOS-2657
 
 Niklas Nielsen wrote:
 Can you be a bit more concrete about your concerns and suggested path 
 forward? Are you suggesting to land 2035 first? Is it the qosKilled bool that 
 concerns you?
 
 Jie Yu wrote:
 My concern is that 'killed' field in Termination is overloaded. If it is 
 true, it means either 1) user initiated killTask 2) killed due to container 
 limit has been reached (e.g., memory, it's buggy already since we also have 
 disk limit now). 3) killed due to QoS controller.
 
 Therefore, relying on this field to decide what 'reason' and 'status' to 
 send back is very unintuitive and hard to understand. Just feel that this is 
 a tech debt we should fix asap.
 
 I don't have a suggestion yet. I need to think about it.

There are a few things we need to address:

1) What status should QoS killed tasks have? Thought TASK_LOST is better than 
TASK_FAILED.
   Adding new terminal statuses requires support in frameworks and my intuition 
was that we should try to exercise the retry logic in the frameworks with 
TASK_LOST.

2) How should the framework be able tell the difference between different 
events.
   In this case, I thought it deserved it's own reason (as those are fine 
grained).

Didn't put more thought into this as I mimicked the existing code, so I am open 
to any change you guys have in mind.
While I agree with all the above, it seems like a larger issue with the 
containerizer API (and not well suited to be addressed here alone?)

 Therefore, relying on this field to decide what 'reason' and 'status' to send 
 back is very unintuitive and hard to understand. Just feel that this is a 
 tech debt we should fix asap.

This patch issues a containerizer-destroy() and this is where it ends; in 
order to distinguish the cause of kill, it looks like we need to change the 
containerizer API?


- Niklas


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34720/#review86669
---


On June 4, 2015, 10:43 a.m., Niklas Nielsen wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34720/
 ---
 
 (Updated June 4, 2015, 10:43 a.m.)
 
 
 Review request for mesos, Bartek Plotka, Jie Yu, Joris Van Remoortere, and 
 Vinod Kone.
 
 
 Bugs: MESOS-2653
 https://issues.apache.org/jira/browse/MESOS-2653
 
 
 Repository: mesos
 
 
 Description
 ---
 
 See summary
 
 
 Diffs
 -
 
   src/slave/slave.hpp 37e85af90275dc0c88c12b39a5813e7d71abb1f3 
   src/slave/slave.cpp 30e0d8ba2a2b12a0a7f1f7246b65360de30ef4ea 
 
 Diff: https://reviews.apache.org/r/34720/diff/
 
 
 Testing
 ---
 
 make check
 
 
 Thanks,
 
 Niklas Nielsen