Re: Review Request 69451: Fixed master crash when executors send messages to recovered frameworks.

2018-11-28 Thread Chun-Hung Hsiao

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69451/
---

(Updated Nov. 28, 2018, 6:13 p.m.)


Review request for mesos, Alexander Rukletsov, Benjamin Mahler, Greg Mann, and 
Till Toenshoff.


Changes
---

Addressed AlexR's comment.


Bugs: MESOS-9419
https://issues.apache.org/jira/browse/MESOS-9419


Repository: mesos


Description
---

The `Framework::send` function assumes that either `http` or `pid` is
set, which is not true for a framework that hasn't yet reregistered yet
but recovered from a reregistered agent. As a result, the master would
crash when a recovered executor tries to send a message to such a
framework (see MESOS-9419). This patch fixes this crash bug.


Diffs (updated)
-

  src/master/master.hpp 3b3c1a4e61de9503c8d038dd3bee623ded5914c9 
  src/master/master.cpp b4b02d8b4d7d6d1aabda1f97b9bf824419f76a9e 


Diff: https://reviews.apache.org/r/69451/diff/4/

Changes: https://reviews.apache.org/r/69451/diff/3-4/


Testing
---

make check


Thanks,

Chun-Hung Hsiao



Re: Review Request 69451: Fixed master crash when executors send messages to recovered frameworks.

2018-11-28 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69451/#review210931
---


Ship it!




Ship It!

- Benjamin Mahler


On Nov. 28, 2018, 12:40 a.m., Chun-Hung Hsiao wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69451/
> ---
> 
> (Updated Nov. 28, 2018, 12:40 a.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Mahler, Greg Mann, 
> and Till Toenshoff.
> 
> 
> Bugs: MESOS-9419
> https://issues.apache.org/jira/browse/MESOS-9419
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The `Framework::send` function assumes that either `http` or `pid` is
> set, which is not true for a framework that hasn't yet reregistered yet
> but recovered from a reregistered agent. As a result, the master would
> crash when a recovered executor tries to send a message to such a
> framework (see MESOS-9419). This patch fixes this crash bug.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 3b3c1a4e61de9503c8d038dd3bee623ded5914c9 
>   src/master/master.cpp b4b02d8b4d7d6d1aabda1f97b9bf824419f76a9e 
> 
> 
> Diff: https://reviews.apache.org/r/69451/diff/3/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>



Re: Review Request 69451: Fixed master crash when executors send messages to recovered frameworks.

2018-11-28 Thread Alexander Rukletsov

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69451/#review210928
---




src/master/master.hpp
Line 2609 (original), 2621 (patched)


Does it make sense to add a log line in case neither `http` not `pid` are 
set?


- Alexander Rukletsov


On Nov. 28, 2018, 12:40 a.m., Chun-Hung Hsiao wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69451/
> ---
> 
> (Updated Nov. 28, 2018, 12:40 a.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Mahler, Greg Mann, 
> and Till Toenshoff.
> 
> 
> Bugs: MESOS-9419
> https://issues.apache.org/jira/browse/MESOS-9419
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The `Framework::send` function assumes that either `http` or `pid` is
> set, which is not true for a framework that hasn't yet reregistered yet
> but recovered from a reregistered agent. As a result, the master would
> crash when a recovered executor tries to send a message to such a
> framework (see MESOS-9419). This patch fixes this crash bug.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 3b3c1a4e61de9503c8d038dd3bee623ded5914c9 
>   src/master/master.cpp b4b02d8b4d7d6d1aabda1f97b9bf824419f76a9e 
> 
> 
> Diff: https://reviews.apache.org/r/69451/diff/3/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>



Re: Review Request 69451: Fixed master crash when executors send messages to recovered frameworks.

2018-11-27 Thread Chun-Hung Hsiao

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69451/
---

(Updated Nov. 28, 2018, 12:40 a.m.)


Review request for mesos, Alexander Rukletsov, Benjamin Mahler, Greg Mann, and 
Till Toenshoff.


Changes
---

Addressed BenM's comment.


Bugs: MESOS-9419
https://issues.apache.org/jira/browse/MESOS-9419


Repository: mesos


Description
---

The `Framework::send` function assumes that either `http` or `pid` is
set, which is not true for a framework that hasn't yet reregistered yet
but recovered from a reregistered agent. As a result, the master would
crash when a recovered executor tries to send a message to such a
framework (see MESOS-9419). This patch fixes this crash bug.


Diffs (updated)
-

  src/master/master.hpp 3b3c1a4e61de9503c8d038dd3bee623ded5914c9 
  src/master/master.cpp b4b02d8b4d7d6d1aabda1f97b9bf824419f76a9e 


Diff: https://reviews.apache.org/r/69451/diff/3/

Changes: https://reviews.apache.org/r/69451/diff/2-3/


Testing
---

make check


Thanks,

Chun-Hung Hsiao



Re: Review Request 69451: Fixed master crash when executors send messages to recovered frameworks.

2018-11-27 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69451/#review210914
---




src/master/master.hpp
Lines 2594-2609 (original), 2596-2619 (patched)


How about:

```
  if (!connected()) {
LOG(WARNING) << "Master attempting to send message to "
 << (recovered() ? "recovered" : "disconnected")
 << " framework " << *this;
 
// NOTE: We proceed here without returning to support the case where a
// `disconnected()` framework is still talking to the master and the
// master wants to shut it down by sending a `FrameworkErrorMessage`.
// This can occur in a one way link breakage where the master ->
// framework link is broken but the framework -> master link remains
// intact. Note that we don't have periodic heartbeating between master
// and pid-based schedulers.
//
// TODO(cshiao): Update the `FrameworkErrorMessage` call-sites that
// rely on the lack of a `return` here to directly call 
`process::send()`
// so that this function doesn't need to deal with the special case.
// Then we can check that if we're connected -> one of `http` or `pid`
// is set.
  }
  
  if (http.isSome()) {
if (!http->send(message)) {
  LOG(WARNING) << "Unable to send event to framework " << *this << ":"
   << " connection closed";
}
  } else if (pid.isSome()) {
master->send(pid.get(), message);
  }
```


- Benjamin Mahler


On Nov. 27, 2018, 11:01 p.m., Chun-Hung Hsiao wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69451/
> ---
> 
> (Updated Nov. 27, 2018, 11:01 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Mahler, Greg Mann, 
> and Till Toenshoff.
> 
> 
> Bugs: MESOS-9419
> https://issues.apache.org/jira/browse/MESOS-9419
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The `Framework::send` function assumes that either `http` or `pid` is
> set, which is not true for a framework that hasn't yet reregistered yet
> but recovered from a reregistered agent. As a result, the master would
> crash when a recovered executor tries to send a message to such a
> framework (see MESOS-9419). This patch fixes this crash bug.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 3b3c1a4e61de9503c8d038dd3bee623ded5914c9 
>   src/master/master.cpp b4b02d8b4d7d6d1aabda1f97b9bf824419f76a9e 
> 
> 
> Diff: https://reviews.apache.org/r/69451/diff/2/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>



Re: Review Request 69451: Fixed master crash when executors send messages to recovered frameworks.

2018-11-27 Thread Chun-Hung Hsiao

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69451/
---

(Updated Nov. 27, 2018, 11:01 p.m.)


Review request for mesos, Alexander Rukletsov, Benjamin Mahler, Greg Mann, and 
Till Toenshoff.


Changes
---

Addressed BenM's comments.


Bugs: MESOS-9419
https://issues.apache.org/jira/browse/MESOS-9419


Repository: mesos


Description (updated)
---

The `Framework::send` function assumes that either `http` or `pid` is
set, which is not true for a framework that hasn't yet reregistered yet
but recovered from a reregistered agent. As a result, the master would
crash when a recovered executor tries to send a message to such a
framework (see MESOS-9419). This patch fixes this crash bug.


Diffs (updated)
-

  src/master/master.hpp 3b3c1a4e61de9503c8d038dd3bee623ded5914c9 
  src/master/master.cpp b4b02d8b4d7d6d1aabda1f97b9bf824419f76a9e 


Diff: https://reviews.apache.org/r/69451/diff/2/

Changes: https://reviews.apache.org/r/69451/diff/1-2/


Testing
---

make check


Thanks,

Chun-Hung Hsiao



Re: Review Request 69451: Fixed master crash when executors send messages to recovered frameworks.

2018-11-27 Thread Chun-Hung Hsiao


> On Nov. 27, 2018, 9:02 p.m., Benjamin Mahler wrote:
> > src/master/master.hpp
> > Line 2599 (original), 2600-2604 (patched)
> > 
> >
> > Hm.. based on this comment, should we only be sending in the 
> > disconnected+pid case?
> > 
> > Also, which messages is this referring to? It seems to suggest we let 
> > all messages through, but in the call sites we seem to guard this call to 
> > avoid calling it for the disconnected case?

The "message" refers to the `message` parameter of this function.

I was trying to avoid calling out what messages apply to this special case in 
this function, to avoid a misleading comment if it becomes outdated in the 
future.


> On Nov. 27, 2018, 9:02 p.m., Benjamin Mahler wrote:
> > src/master/master.hpp
> > Line 2609 (original), 2614 (patched)
> > 
> >
> > Do we want a similar log warning here for the else case?
> > 
> > ```
> > } else {
> >   LOG(WARNING) << "Unable to send event to framework " << *this << ":"
> ><< " framework is recovered but hasn't re-registered";
> > }
> > ```
> > 
> > Is this accurate or are there other cases? My read of the http/pid 
> > state comment is that it will be set based on the last connection, 
> > therefore the only time one of them won't be set is the recovered case (the 
> > change that broke the original one-being-set invariant).

Yes I believe this is the case. Let me add the warning.


- Chun-Hung


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69451/#review210904
---


On Nov. 27, 2018, 4:58 a.m., Chun-Hung Hsiao wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69451/
> ---
> 
> (Updated Nov. 27, 2018, 4:58 a.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Mahler, Greg Mann, 
> and Till Toenshoff.
> 
> 
> Bugs: MESOS-9419
> https://issues.apache.org/jira/browse/MESOS-9419
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The `Framework::send` function assumes that either `http` or `pid` is
> set, which is not true for a unregistered framework recovered from agent
> reregistration. As a result, the master would crash when a recovered
> executor tries to send a message to such a framework. This patch fixes
> this crash bug.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 3b3c1a4e61de9503c8d038dd3bee623ded5914c9 
>   src/master/master.cpp b4b02d8b4d7d6d1aabda1f97b9bf824419f76a9e 
> 
> 
> Diff: https://reviews.apache.org/r/69451/diff/1/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>



Re: Review Request 69451: Fixed master crash when executors send messages to recovered frameworks.

2018-11-27 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69451/#review210904
---



Thanks for jumping on this and producing a fix! Some small edit suggestions to 
the description:

```
The Framework::send function assumes that either http or pid is
set, which is not true for a unregistered framework recovered from agent  // 
s/unregistered//
reregistration. As a result, the master would crash when a recovered  // 
s/reregistration/that hasn't yet re-registered/
executor tries to send a message to such a framework. This patch fixes// 
s/framework/framework (see MESOS-).
this crash bug.
```


src/master/master.hpp
Line 2595 (original), 2597 (patched)


attempting?



src/master/master.hpp
Line 2599 (original), 2600-2604 (patched)


Hm.. based on this comment, should we only be sending in the 
disconnected+pid case?

Also, which messages is this referring to? It seems to suggest we let all 
messages through, but in the call sites we seem to guard this call to avoid 
calling it for the disconnected case?



src/master/master.hpp
Line 2609 (original), 2614 (patched)


Do we want a similar log warning here for the else case?

```
} else {
  LOG(WARNING) << "Unable to send event to framework " << *this << ":"
   << " framework is recovered but hasn't re-registered";
}
```

Is this accurate or are there other cases? My read of the http/pid state 
comment is that it will be set based on the last connection, therefore the only 
time one of them won't be set is the recovered case (the change that broke the 
original one-being-set invariant).


- Benjamin Mahler


On Nov. 27, 2018, 4:58 a.m., Chun-Hung Hsiao wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69451/
> ---
> 
> (Updated Nov. 27, 2018, 4:58 a.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Mahler, Greg Mann, 
> and Till Toenshoff.
> 
> 
> Bugs: MESOS-9419
> https://issues.apache.org/jira/browse/MESOS-9419
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The `Framework::send` function assumes that either `http` or `pid` is
> set, which is not true for a unregistered framework recovered from agent
> reregistration. As a result, the master would crash when a recovered
> executor tries to send a message to such a framework. This patch fixes
> this crash bug.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 3b3c1a4e61de9503c8d038dd3bee623ded5914c9 
>   src/master/master.cpp b4b02d8b4d7d6d1aabda1f97b9bf824419f76a9e 
> 
> 
> Diff: https://reviews.apache.org/r/69451/diff/1/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>



Review Request 69451: Fixed master crash when executors send messages to recovered frameworks.

2018-11-26 Thread Chun-Hung Hsiao

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69451/
---

Review request for mesos, Alexander Rukletsov, Benjamin Mahler, Greg Mann, and 
Till Toenshoff.


Bugs: MESOS-9419
https://issues.apache.org/jira/browse/MESOS-9419


Repository: mesos


Description
---

The `Framework::send` function assumes that either `http` or `pid` is
set, which is not true for a unregistered framework recovered from agent
reregistration. As a result, the master would crash when a recovered
executor tries to send a message to such a framework. This patch fixes
this crash bug.


Diffs
-

  src/master/master.hpp 3b3c1a4e61de9503c8d038dd3bee623ded5914c9 
  src/master/master.cpp b4b02d8b4d7d6d1aabda1f97b9bf824419f76a9e 


Diff: https://reviews.apache.org/r/69451/diff/1/


Testing
---

make check


Thanks,

Chun-Hung Hsiao