Re: Review Request 50540: Add systemd watchdog support.

2017-03-01 Thread Joris Van Remoortere

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50540/#review167368
---



Closing this review due to inactivity. Please see our 
[guidelines](https://github.com/apache/mesos/blob/master/docs/reopening-reviews.md)
 for reopening reviews.

- Joris Van Remoortere


On Aug. 19, 2016, 12:09 a.m., Lawrence Wu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50540/
> ---
> 
> (Updated Aug. 19, 2016, 12:09 a.m.)
> 
> 
> Review request for mesos, David Robinson, Ian Downes, and Jie Yu.
> 
> 
> Bugs: MESOS-5376
> https://issues.apache.org/jira/browse/MESOS-5376
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Add systemd watchdog support.
> 
> 
> Diffs
> -
> 
>   configure.ac d2136909b7305498ae901a5ea00133142b77f9e6 
>   src/Makefile.am 599ebbef6d164fb2a530b55427ddabb5cd607634 
>   src/linux/systemd.hpp 91134f1d4b100759e45931bd09ca4e1e1aeaaf8a 
>   src/linux/systemd.cpp 619aa2778da5f99d3a078a8e1208bdaa9dc77581 
>   src/slave/main.cpp 4624392d30cf391015dcd63f447fe2414a47a16a 
>   src/tests/linux/systemd_test_helper.hpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper.cpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper_main.cpp PRE-CREATION 
>   src/tests/linux/systemd_tests.cpp PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/50540/diff/5/
> 
> 
> Testing
> ---
> 
> Tested by sending SIGSTOP to running mesos and verifying via journalctl that 
> it was killed by the watchdog.
> 
> The test I wrote for this does the following:
> - build up a unit file as a string and create a unit file in 
> /etc/systemd/system/systemd-test-helper.service
> - reload the systemd daemon and start the newly discovered helper service
> - wait a bit (30s) to make sure the watchdog has had a chance to kill the 
> service
> - use systemctl status systemd-test-helper to check that the service is still 
> running
> - clean up the unit file.
> 
> 
> Thanks,
> 
> Lawrence Wu
> 
>



Re: Review Request 50540: Add systemd watchdog support.

2016-08-19 Thread Jie Yu


> On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote:
> > High level question: is this useful? I imagine watchdog logic for agent is 
> > to call sd_notify from `Slave` actor (similar to Master) so that we know 
> > `Slave` is still functional. I think this is the right way to detect 
> > hanging.
> > 
> > The systemd library code should provide primitive to the Agent code so that 
> > it does not call raw `sd_notify` directly.
> 
> Jie Yu wrote:
> Scratch that. I misunderstood what Watchdog is for. But in general, i 
> don't get why we need that. Sound like not a very effective way to detect 
> hanging.
> 
> Lawrence Wu wrote:
> We (Twitter) currently use monit to monitor mesos, but we'd like to use 
> the systemd watchdog to do the monitoring and detect hangs instead. Could you 
> elaborate on why you think it is not very effective? The way I see it, there 
> is very little extra code and performance overhead (sd_notify is a trivial 
> operation), and we gain reliability on the Mesos side.

For instance, the Slave actor is doing a LOG. Say the disk is slow and it takes 
a long time. Will this watchdog detect the hang?


- Jie


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50540/#review145876
---


On Aug. 19, 2016, 12:09 a.m., Lawrence Wu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50540/
> ---
> 
> (Updated Aug. 19, 2016, 12:09 a.m.)
> 
> 
> Review request for mesos, David Robinson, Ian Downes, and Jie Yu.
> 
> 
> Bugs: MESOS-5376
> https://issues.apache.org/jira/browse/MESOS-5376
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Add systemd watchdog support.
> 
> 
> Diffs
> -
> 
>   configure.ac d2136909b7305498ae901a5ea00133142b77f9e6 
>   src/Makefile.am 599ebbef6d164fb2a530b55427ddabb5cd607634 
>   src/linux/systemd.hpp 91134f1d4b100759e45931bd09ca4e1e1aeaaf8a 
>   src/linux/systemd.cpp 619aa2778da5f99d3a078a8e1208bdaa9dc77581 
>   src/slave/main.cpp 4624392d30cf391015dcd63f447fe2414a47a16a 
>   src/tests/linux/systemd_test_helper.hpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper.cpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper_main.cpp PRE-CREATION 
>   src/tests/linux/systemd_tests.cpp PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/50540/diff/
> 
> 
> Testing
> ---
> 
> Tested by sending SIGSTOP to running mesos and verifying via journalctl that 
> it was killed by the watchdog.
> 
> The test I wrote for this does the following:
> - build up a unit file as a string and create a unit file in 
> /etc/systemd/system/systemd-test-helper.service
> - reload the systemd daemon and start the newly discovered helper service
> - wait a bit (30s) to make sure the watchdog has had a chance to kill the 
> service
> - use systemctl status systemd-test-helper to check that the service is still 
> running
> - clean up the unit file.
> 
> 
> Thanks,
> 
> Lawrence Wu
> 
>



Re: Review Request 50540: Add systemd watchdog support.

2016-08-19 Thread Jie Yu


> On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote:
> > src/linux/systemd.cpp, line 207
> > 
> >
> > Who will delete the Watchdog? I think you should use a `managed` spawn. 
> > See `spawn` definition.
> 
> Lawrence Wu wrote:
> Reading the definition for a managed spawn, it looks like we do not want 
> a managed process -- that will get garbage collected. We want the watchdog to 
> be alive for the entire lifetime of the process. I added a comment clarifying 
> this.

Can you read the code more? If the actor does not terminate, will it get 
garbaged collected?


- Jie


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50540/#review145876
---


On Aug. 19, 2016, 12:09 a.m., Lawrence Wu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50540/
> ---
> 
> (Updated Aug. 19, 2016, 12:09 a.m.)
> 
> 
> Review request for mesos, David Robinson, Ian Downes, and Jie Yu.
> 
> 
> Bugs: MESOS-5376
> https://issues.apache.org/jira/browse/MESOS-5376
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Add systemd watchdog support.
> 
> 
> Diffs
> -
> 
>   configure.ac d2136909b7305498ae901a5ea00133142b77f9e6 
>   src/Makefile.am 599ebbef6d164fb2a530b55427ddabb5cd607634 
>   src/linux/systemd.hpp 91134f1d4b100759e45931bd09ca4e1e1aeaaf8a 
>   src/linux/systemd.cpp 619aa2778da5f99d3a078a8e1208bdaa9dc77581 
>   src/slave/main.cpp 4624392d30cf391015dcd63f447fe2414a47a16a 
>   src/tests/linux/systemd_test_helper.hpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper.cpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper_main.cpp PRE-CREATION 
>   src/tests/linux/systemd_tests.cpp PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/50540/diff/
> 
> 
> Testing
> ---
> 
> Tested by sending SIGSTOP to running mesos and verifying via journalctl that 
> it was killed by the watchdog.
> 
> The test I wrote for this does the following:
> - build up a unit file as a string and create a unit file in 
> /etc/systemd/system/systemd-test-helper.service
> - reload the systemd daemon and start the newly discovered helper service
> - wait a bit (30s) to make sure the watchdog has had a chance to kill the 
> service
> - use systemctl status systemd-test-helper to check that the service is still 
> running
> - clean up the unit file.
> 
> 
> Thanks,
> 
> Lawrence Wu
> 
>



Re: Review Request 50540: Add systemd watchdog support.

2016-08-19 Thread Jie Yu


> On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote:
> > configure.ac, lines 615-616
> > 
> >
> > Looks like you haven't checked headers. You need dev headers, right?
> 
> Lawrence Wu wrote:
> I'm not too sure, but I think we just need this lib for sd_notify. I 
> tried building this on a machine without the systemd dev headers and it 
> worked, so I don't think dev headers are necessary.

You mean `#include ` is not part of the dev headers? Can 
you make sure this is the case in all supported OS distribution? E.g., CentOS 
7, Ubuntu 12, 14, 16, Debian 8


- Jie


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50540/#review145876
---


On Aug. 19, 2016, 12:09 a.m., Lawrence Wu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50540/
> ---
> 
> (Updated Aug. 19, 2016, 12:09 a.m.)
> 
> 
> Review request for mesos, David Robinson, Ian Downes, and Jie Yu.
> 
> 
> Bugs: MESOS-5376
> https://issues.apache.org/jira/browse/MESOS-5376
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Add systemd watchdog support.
> 
> 
> Diffs
> -
> 
>   configure.ac d2136909b7305498ae901a5ea00133142b77f9e6 
>   src/Makefile.am 599ebbef6d164fb2a530b55427ddabb5cd607634 
>   src/linux/systemd.hpp 91134f1d4b100759e45931bd09ca4e1e1aeaaf8a 
>   src/linux/systemd.cpp 619aa2778da5f99d3a078a8e1208bdaa9dc77581 
>   src/slave/main.cpp 4624392d30cf391015dcd63f447fe2414a47a16a 
>   src/tests/linux/systemd_test_helper.hpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper.cpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper_main.cpp PRE-CREATION 
>   src/tests/linux/systemd_tests.cpp PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/50540/diff/
> 
> 
> Testing
> ---
> 
> Tested by sending SIGSTOP to running mesos and verifying via journalctl that 
> it was killed by the watchdog.
> 
> The test I wrote for this does the following:
> - build up a unit file as a string and create a unit file in 
> /etc/systemd/system/systemd-test-helper.service
> - reload the systemd daemon and start the newly discovered helper service
> - wait a bit (30s) to make sure the watchdog has had a chance to kill the 
> service
> - use systemctl status systemd-test-helper to check that the service is still 
> running
> - clean up the unit file.
> 
> 
> Thanks,
> 
> Lawrence Wu
> 
>



Re: Review Request 50540: Add systemd watchdog support.

2016-08-18 Thread Lawrence Wu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50540/
---

(Updated Aug. 19, 2016, 12:09 a.m.)


Review request for mesos, David Robinson, Ian Downes, and Jie Yu.


Changes
---

use os::rm instead of subprocess unlink


Bugs: MESOS-5376
https://issues.apache.org/jira/browse/MESOS-5376


Repository: mesos


Description
---

Add systemd watchdog support.


Diffs (updated)
-

  configure.ac d2136909b7305498ae901a5ea00133142b77f9e6 
  src/Makefile.am 599ebbef6d164fb2a530b55427ddabb5cd607634 
  src/linux/systemd.hpp 91134f1d4b100759e45931bd09ca4e1e1aeaaf8a 
  src/linux/systemd.cpp 619aa2778da5f99d3a078a8e1208bdaa9dc77581 
  src/slave/main.cpp 4624392d30cf391015dcd63f447fe2414a47a16a 
  src/tests/linux/systemd_test_helper.hpp PRE-CREATION 
  src/tests/linux/systemd_test_helper.cpp PRE-CREATION 
  src/tests/linux/systemd_test_helper_main.cpp PRE-CREATION 
  src/tests/linux/systemd_tests.cpp PRE-CREATION 

Diff: https://reviews.apache.org/r/50540/diff/


Testing
---

Tested by sending SIGSTOP to running mesos and verifying via journalctl that it 
was killed by the watchdog.

The test I wrote for this does the following:
- build up a unit file as a string and create a unit file in 
/etc/systemd/system/systemd-test-helper.service
- reload the systemd daemon and start the newly discovered helper service
- wait a bit (30s) to make sure the watchdog has had a chance to kill the 
service
- use systemctl status systemd-test-helper to check that the service is still 
running
- clean up the unit file.


Thanks,

Lawrence Wu



Re: Review Request 50540: Add systemd watchdog support.

2016-08-18 Thread Lawrence Wu


> On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote:
> > High level question: is this useful? I imagine watchdog logic for agent is 
> > to call sd_notify from `Slave` actor (similar to Master) so that we know 
> > `Slave` is still functional. I think this is the right way to detect 
> > hanging.
> > 
> > The systemd library code should provide primitive to the Agent code so that 
> > it does not call raw `sd_notify` directly.
> 
> Jie Yu wrote:
> Scratch that. I misunderstood what Watchdog is for. But in general, i 
> don't get why we need that. Sound like not a very effective way to detect 
> hanging.

We (Twitter) currently use monit to monitor mesos, but we'd like to use the 
systemd watchdog to do the monitoring and detect hangs instead. Could you 
elaborate on why you think it is not very effective? The way I see it, there is 
very little extra code and performance overhead (sd_notify is a trivial 
operation), and we gain reliability on the Mesos side.


> On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote:
> > src/linux/systemd.cpp, line 207
> > 
> >
> > Who will delete the Watchdog? I think you should use a `managed` spawn. 
> > See `spawn` definition.

Reading the definition for a managed spawn, it looks like we do not want a 
managed process -- that will get garbage collected. We want the watchdog to be 
alive for the entire lifetime of the process. I added a comment clarifying this.


> On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote:
> > src/linux/systemd.cpp, line 208
> > 
> >
> > Why create the watch dog actor if WATCHDOG_USEC is not set?

this should be fixed now.


> On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote:
> > src/linux/systemd.hpp, line 32
> > 
> >
> > We now does not use `using namespace` any more. Please include them one 
> > by one using `using xxx;` clauses.

Done.


> On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote:
> > src/linux/systemd.cpp, line 203
> > 
> >
> > Hum, this also could be the case where WATCHDOG_USEC is not set, right? 
> > The error here sounds unintuitive.
> > 
> > I'd suggest we check env exists or not, and then parse it.
> > 
> > ```
> > if (watchdogUsec.isNone()) {
> > } else if (watchDog.isError()) {
> > }
> > ```

Done.


> On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote:
> > configure.ac, lines 615-616
> > 
> >
> > Looks like you haven't checked headers. You need dev headers, right?

I'm not too sure, but I think we just need this lib for sd_notify. I tried 
building this on a machine without the systemd dev headers and it worked, so I 
don't think dev headers are necessary.


- Lawrence


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50540/#review145876
---


On Aug. 18, 2016, 11:19 p.m., Lawrence Wu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50540/
> ---
> 
> (Updated Aug. 18, 2016, 11:19 p.m.)
> 
> 
> Review request for mesos, David Robinson, Ian Downes, and Jie Yu.
> 
> 
> Bugs: MESOS-5376
> https://issues.apache.org/jira/browse/MESOS-5376
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Add systemd watchdog support.
> 
> 
> Diffs
> -
> 
>   configure.ac d2136909b7305498ae901a5ea00133142b77f9e6 
>   src/Makefile.am 599ebbef6d164fb2a530b55427ddabb5cd607634 
>   src/linux/systemd.hpp 91134f1d4b100759e45931bd09ca4e1e1aeaaf8a 
>   src/linux/systemd.cpp 619aa2778da5f99d3a078a8e1208bdaa9dc77581 
>   src/slave/main.cpp 4624392d30cf391015dcd63f447fe2414a47a16a 
>   src/tests/linux/systemd_test_helper.hpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper.cpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper_main.cpp PRE-CREATION 
>   src/tests/linux/systemd_tests.cpp PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/50540/diff/
> 
> 
> Testing
> ---
> 
> Tested by sending SIGSTOP to running mesos and verifying via journalctl that 
> it was killed by the watchdog.
> 
> The test I wrote for this does the following:
> - build up a unit file as a string and create a unit file in 
> /etc/systemd/system/systemd-test-helper.service
> - reload the systemd daemon and start the newly discovered helper service
> - wait a bit (30s) to make sure the watchdog has had a chance to kill the 
> service
> - use systemctl status systemd-test-helper to check that the service is still 
> running
> - clean up the unit 

Re: Review Request 50540: Add systemd watchdog support.

2016-08-18 Thread Lawrence Wu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50540/
---

(Updated Aug. 18, 2016, 11:19 p.m.)


Review request for mesos, David Robinson, Ian Downes, and Jie Yu.


Changes
---

address issues, write second test


Bugs: MESOS-5376
https://issues.apache.org/jira/browse/MESOS-5376


Repository: mesos


Description
---

Add systemd watchdog support.


Diffs (updated)
-

  configure.ac d2136909b7305498ae901a5ea00133142b77f9e6 
  src/Makefile.am 599ebbef6d164fb2a530b55427ddabb5cd607634 
  src/linux/systemd.hpp 91134f1d4b100759e45931bd09ca4e1e1aeaaf8a 
  src/linux/systemd.cpp 619aa2778da5f99d3a078a8e1208bdaa9dc77581 
  src/slave/main.cpp 4624392d30cf391015dcd63f447fe2414a47a16a 
  src/tests/linux/systemd_test_helper.hpp PRE-CREATION 
  src/tests/linux/systemd_test_helper.cpp PRE-CREATION 
  src/tests/linux/systemd_test_helper_main.cpp PRE-CREATION 
  src/tests/linux/systemd_tests.cpp PRE-CREATION 

Diff: https://reviews.apache.org/r/50540/diff/


Testing (updated)
---

Tested by sending SIGSTOP to running mesos and verifying via journalctl that it 
was killed by the watchdog.

The test I wrote for this does the following:
- build up a unit file as a string and create a unit file in 
/etc/systemd/system/systemd-test-helper.service
- reload the systemd daemon and start the newly discovered helper service
- wait a bit (30s) to make sure the watchdog has had a chance to kill the 
service
- use systemctl status systemd-test-helper to check that the service is still 
running
- clean up the unit file.


Thanks,

Lawrence Wu



Re: Review Request 50540: Add systemd watchdog support.

2016-08-16 Thread Jie Yu


> On Aug. 17, 2016, 12:44 a.m., Jie Yu wrote:
> > High level question: is this useful? I imagine watchdog logic for agent is 
> > to call sd_notify from `Slave` actor (similar to Master) so that we know 
> > `Slave` is still functional. I think this is the right way to detect 
> > hanging.
> > 
> > The systemd library code should provide primitive to the Agent code so that 
> > it does not call raw `sd_notify` directly.

Scratch that. I misunderstood what Watchdog is for. But in general, i don't get 
why we need that. Sound like not a very effective way to detect hanging.


- Jie


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50540/#review145876
---


On Aug. 15, 2016, 6:40 p.m., Lawrence Wu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50540/
> ---
> 
> (Updated Aug. 15, 2016, 6:40 p.m.)
> 
> 
> Review request for mesos, David Robinson, Ian Downes, and Jie Yu.
> 
> 
> Bugs: MESOS-5376
> https://issues.apache.org/jira/browse/MESOS-5376
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Add systemd watchdog support.
> 
> 
> Diffs
> -
> 
>   configure.ac d2136909b7305498ae901a5ea00133142b77f9e6 
>   src/Makefile.am 599ebbef6d164fb2a530b55427ddabb5cd607634 
>   src/linux/systemd.hpp 91134f1d4b100759e45931bd09ca4e1e1aeaaf8a 
>   src/linux/systemd.cpp 619aa2778da5f99d3a078a8e1208bdaa9dc77581 
>   src/slave/main.cpp 4624392d30cf391015dcd63f447fe2414a47a16a 
>   src/tests/linux/systemd_test_helper.hpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper.cpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper_main.cpp PRE-CREATION 
>   src/tests/linux/systemd_tests.cpp PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/50540/diff/
> 
> 
> Testing
> ---
> 
> Tested by sending SIGSTOP to running mesos and verifying via journalctl that 
> it was killed by the watchdog.
> 
> The test I wrote for this does the following:
> - build up a unit file as a string and create a unit file in 
> /etc/systemd/system/systemd-test-helper.service
> - reload the systemd daemon and start the newly discovered helper service
> - wait a bit (30s) to make sure the watchdog has had a chance to kill the 
> service
> - use systemctl status systemd-test-helper to check that the service is still 
> running
> - clean up the unit file.
> 
> TODO: create a similar test, but send a SIGSTOP to the service and ensure 
> that it has been killed by watchdog.
> 
> 
> Thanks,
> 
> Lawrence Wu
> 
>



Re: Review Request 50540: Add systemd watchdog support.

2016-08-16 Thread Jie Yu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50540/#review145876
---



High level question: is this useful? I imagine watchdog logic for agent is to 
call sd_notify from `Slave` actor (similar to Master) so that we know `Slave` 
is still functional. I think this is the right way to detect hanging.

The systemd library code should provide primitive to the Agent code so that it 
does not call raw `sd_notify` directly.


configure.ac (lines 615 - 616)


Looks like you haven't checked headers. You need dev headers, right?



src/linux/systemd.hpp (line 32)


We now does not use `using namespace` any more. Please include them one by 
one using `using xxx;` clauses.



src/linux/systemd.cpp (line 203)


Hum, this also could be the case where WATCHDOG_USEC is not set, right? The 
error here sounds unintuitive.

I'd suggest we check env exists or not, and then parse it.

```
if (watchdogUsec.isNone()) {
} else if (watchDog.isError()) {
}
```



src/linux/systemd.cpp (line 207)


Who will delete the Watchdog? I think you should use a `managed` spawn. See 
`spawn` definition.



src/linux/systemd.cpp (line 208)


Why create the watch dog actor if WATCHDOG_USEC is not set?


- Jie Yu


On Aug. 15, 2016, 6:40 p.m., Lawrence Wu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50540/
> ---
> 
> (Updated Aug. 15, 2016, 6:40 p.m.)
> 
> 
> Review request for mesos, David Robinson, Ian Downes, and Jie Yu.
> 
> 
> Bugs: MESOS-5376
> https://issues.apache.org/jira/browse/MESOS-5376
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Add systemd watchdog support.
> 
> 
> Diffs
> -
> 
>   configure.ac d2136909b7305498ae901a5ea00133142b77f9e6 
>   src/Makefile.am 599ebbef6d164fb2a530b55427ddabb5cd607634 
>   src/linux/systemd.hpp 91134f1d4b100759e45931bd09ca4e1e1aeaaf8a 
>   src/linux/systemd.cpp 619aa2778da5f99d3a078a8e1208bdaa9dc77581 
>   src/slave/main.cpp 4624392d30cf391015dcd63f447fe2414a47a16a 
>   src/tests/linux/systemd_test_helper.hpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper.cpp PRE-CREATION 
>   src/tests/linux/systemd_test_helper_main.cpp PRE-CREATION 
>   src/tests/linux/systemd_tests.cpp PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/50540/diff/
> 
> 
> Testing
> ---
> 
> Tested by sending SIGSTOP to running mesos and verifying via journalctl that 
> it was killed by the watchdog.
> 
> The test I wrote for this does the following:
> - build up a unit file as a string and create a unit file in 
> /etc/systemd/system/systemd-test-helper.service
> - reload the systemd daemon and start the newly discovered helper service
> - wait a bit (30s) to make sure the watchdog has had a chance to kill the 
> service
> - use systemctl status systemd-test-helper to check that the service is still 
> running
> - clean up the unit file.
> 
> TODO: create a similar test, but send a SIGSTOP to the service and ensure 
> that it has been killed by watchdog.
> 
> 
> Thanks,
> 
> Lawrence Wu
> 
>



Re: Review Request 50540: Add systemd watchdog support.

2016-08-12 Thread Lawrence Wu


> On Aug. 9, 2016, 6:33 p.m., Ian Downes wrote:
> > src/linux/systemd.hpp, line 36
> > 
> >
> > const reference?
> > 
> > Does the period need to change for different calls? Why not make it a 
> > static instance variable?

yes, let's make this an instance variable.


> On Aug. 9, 2016, 6:33 p.m., Ian Downes wrote:
> > src/Makefile.am, line 2145
> > 
> >
> > Formatting should be 8 ts which I don't think this is. Use "set ts=8" 
> > in Vim.

fixed


> On Aug. 9, 2016, 6:33 p.m., Ian Downes wrote:
> > src/tests/linux/systemd_tests.cpp, lines 122-123
> > 
> >
> > Does this validate that the service has actually been stopped or just 
> > that systemd accepted the stop command?

This just asserts that systemd has accepted the stop command and does not 
ensure that the service was actually stopped. I think it is safe to assume that 
systemd will correctly stop the service.


> On Aug. 9, 2016, 6:33 p.m., Ian Downes wrote:
> > src/tests/linux/systemd_tests.cpp, line 126
> > 
> >
> > What happens if there's an assertion during the test!? Because this is 
> > outside the test sandbox it'll leak. See earlier comment about writing and 
> > referencing this from the sandbox.

see above, using systemctl link and disable now.


> On Aug. 9, 2016, 6:33 p.m., Ian Downes wrote:
> > src/tests/linux/systemd_tests.cpp, line 44
> > 
> >
> > Is there any way to create the service files outside of the system 
> > directory to avoid polluting it (especially if the test crashes)?
> > 
> > And, systemd-test-helper is a pretty generic name...

I did originally try creating the service file in the test sandbox because the 
I thought I could run systemctl with SYSTEMD_UNIT_PATH set to the sandbox 
directory and then pick up my service file. Turns out you can't set 
SYSTEMD_UNIT_PATH on runtime -- it has to get set when compiling ("Unit files 
are loaded from a set of paths determined during compilation"). So this seemed 
like the only option.

I just investigated this some more, and you _can_ create the service file in 
the sandbox and use `systemctl link $file`, which will create a symlink in 
/etc/systemd/system/...which seems like almost as much pollution (symlink vs 
file), but is maybe a little better since if the test crashes and the user sees 
a symlinked file that points to nowhere, she can feel a little safer deleting 
the symlink. So let's go with this way.

Also renamed systemd-test-helper to mesos-systemd-test-helper.


> On Aug. 9, 2016, 6:33 p.m., Ian Downes wrote:
> > src/tests/linux/systemd_test_helper.cpp, line 34
> > 
> >
> > Is this universally true across all distributions?

Actually, we don't use any of the flags in the test, so let's just not set them.


> On Aug. 9, 2016, 6:33 p.m., Ian Downes wrote:
> > src/tests/linux/systemd_test_helper.cpp, line 35
> > 
> >
> > Can't assume cgroups are mounted here. Tests use TEST_CGROUPS_HIERARCHY?

not setting this anymore


> On Aug. 9, 2016, 6:33 p.m., Ian Downes wrote:
> > src/linux/systemd.cpp, line 196
> > 
> >
> > Is there any behavior defined if the variable is set but with no value? 
> > As opposed to being unset?

systemd shouldn't set the variable improperly (with no value), but it's still 
possible that happens (or someone runs mesos with this envvar)...
I added some error handling here to catch unparseable variables.


> On Aug. 9, 2016, 6:33 p.m., Ian Downes wrote:
> > configure.ac, lines 615-616
> > 
> >
> > What are the implications of linking this in for people that have 
> > systemd but don't want to manage Mesos with systemd? Should this be a 
> > configure option?

If users don't want to manage Mesos with systemd, they can set the 
systemd_enable_support flag to false. I don't think they should have to pass in 
another configure option for whether they want to use systemd or not -- let's 
just link the library if it exists and let the user disable systemd with the 
flag.


> On Aug. 9, 2016, 6:33 p.m., Ian Downes wrote:
> > src/tests/linux/systemd_tests.cpp, line 87
> > 
> >
> > no, no! bad... what can you set up to wait on to determine when it's 
> > been started?

Unfortunately, I think we have to sleep here because we are waiting on systemd 
(which is 

Re: Review Request 50540: Add systemd watchdog support.

2016-08-12 Thread Lawrence Wu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50540/
---

(Updated Aug. 12, 2016, 8:18 p.m.)


Review request for mesos, David Robinson and Ian Downes.


Changes
---

address feedback


Bugs: MESOS-5376
https://issues.apache.org/jira/browse/MESOS-5376


Repository: mesos


Description
---

Add systemd watchdog support.


Diffs (updated)
-

  configure.ac d2136909b7305498ae901a5ea00133142b77f9e6 
  src/Makefile.am 599ebbef6d164fb2a530b55427ddabb5cd607634 
  src/linux/systemd.hpp 91134f1d4b100759e45931bd09ca4e1e1aeaaf8a 
  src/linux/systemd.cpp 619aa2778da5f99d3a078a8e1208bdaa9dc77581 
  src/slave/main.cpp 4624392d30cf391015dcd63f447fe2414a47a16a 
  src/tests/linux/systemd_test_helper.hpp PRE-CREATION 
  src/tests/linux/systemd_test_helper.cpp PRE-CREATION 
  src/tests/linux/systemd_test_helper_main.cpp PRE-CREATION 
  src/tests/linux/systemd_tests.cpp PRE-CREATION 

Diff: https://reviews.apache.org/r/50540/diff/


Testing
---

Tested by sending SIGSTOP to running mesos and verifying via journalctl that it 
was killed by the watchdog.

The test I wrote for this does the following:
- build up a unit file as a string and create a unit file in 
/etc/systemd/system/systemd-test-helper.service
- reload the systemd daemon and start the newly discovered helper service
- wait a bit (30s) to make sure the watchdog has had a chance to kill the 
service
- use systemctl status systemd-test-helper to check that the service is still 
running
- clean up the unit file.

TODO: create a similar test, but send a SIGSTOP to the service and ensure that 
it has been killed by watchdog.


Thanks,

Lawrence Wu



Re: Review Request 50540: Add systemd watchdog support.

2016-08-09 Thread Ian Downes

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50540/#review144525
---




configure.ac (lines 615 - 616)


What are the implications of linking this in for people that have systemd 
but don't want to manage Mesos with systemd? Should this be a configure option?



src/Makefile.am (line 2145)


Formatting should be 8 ts which I don't think this is. Use "set ts=8" in 
Vim.



src/Makefile.am (line 2193)


Ditto, formatting.



src/linux/systemd.hpp (lines 20 - 21)


We alphabetize includes so process.hpp comes before subprocess.hpp. See the 
style guide.



src/linux/systemd.hpp (line 36)


const reference?

Does the period need to change for different calls? Why not make it a 
static instance variable?



src/linux/systemd.hpp (line 38)


Check style, I think it's to include comment on close of #ifdef, e.g.,

#endif // HAVE_LIBSYSTEMD



src/linux/systemd.cpp (line 31)


Do you need this conditional compilation *within* the linux/systemd.cpp 
file, or is inclusion of the entire file conditional?



src/linux/systemd.cpp (line 195)


ditto



src/linux/systemd.cpp (line 196)


Is there any behavior defined if the variable is set but with no value? As 
opposed to being unset?



src/linux/systemd.cpp (line 199)


use os::getenv()



src/linux/systemd.cpp (line 200)


use numify<>()



src/linux/systemd.cpp (line 201)


Watchdog* watchdog

What's the lifetime of this object!? Who (and when) releases it?



src/linux/systemd.cpp (line 212)


ditto



src/linux/systemd.cpp (line 213)


This would be const& but see earlier.



src/tests/linux/systemd_test_helper.cpp (lines 19 - 21)


split these



src/tests/linux/systemd_test_helper.cpp (line 34)


Is this universally true across all distributions?



src/tests/linux/systemd_test_helper.cpp (line 35)


Can't assume cgroups are mounted here. Tests use TEST_CGROUPS_HIERARCHY?



src/tests/linux/systemd_tests.cpp (lines 17 - 22)


alphabetize please.



src/tests/linux/systemd_tests.cpp (line 44)


Is there any way to create the service files outside of the system 
directory to avoid polluting it (especially if the test crashes)?

And, systemd-test-helper is a pretty generic name...



src/tests/linux/systemd_tests.cpp (line 45)


import this and std::cout etc.



src/tests/linux/systemd_tests.cpp (line 47)


std::endl on preceeding line?



src/tests/linux/systemd_tests.cpp (lines 65 - 68)


Use stout's os::write(path, message).



src/tests/linux/systemd_tests.cpp (line 71)


Don't use informational logging, use assertions with logging:
ASSERT(false) << "Ooops!";



src/tests/linux/systemd_tests.cpp (line 79)


ditto



src/tests/linux/systemd_tests.cpp (line 87)


no, no! bad... what can you set up to wait on to determine when it's been 
started?



src/tests/linux/systemd_tests.cpp (line 89)


Use path::join()



src/tests/linux/systemd_tests.cpp (line 90)


ditto



src/tests/linux/systemd_tests.cpp (lines 101 - 109)


Use os::read(tmpout) to get a Result and avoid all of this.



src/tests/linux/systemd_tests.cpp (lines 112 - 113)


Use stout's `bool strings::contains()`



src/tests/linux/systemd_tests.cpp (line 116)


ditto




Re: Review Request 50540: Add systemd watchdog support.

2016-07-27 Thread Lawrence Wu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50540/
---

(Updated July 27, 2016, 11:38 p.m.)


Review request for mesos, David Robinson and Ian Downes.


Changes
---

remove some dead code


Bugs: MESOS-5376
https://issues.apache.org/jira/browse/MESOS-5376


Repository: mesos


Description (updated)
---

Add systemd watchdog support.


Diffs (updated)
-

  configure.ac d2136909b7305498ae901a5ea00133142b77f9e6 
  src/Makefile.am 599ebbef6d164fb2a530b55427ddabb5cd607634 
  src/linux/systemd.hpp 91134f1d4b100759e45931bd09ca4e1e1aeaaf8a 
  src/linux/systemd.cpp 619aa2778da5f99d3a078a8e1208bdaa9dc77581 
  src/slave/main.cpp 4624392d30cf391015dcd63f447fe2414a47a16a 
  src/tests/linux/systemd_test_helper.hpp PRE-CREATION 
  src/tests/linux/systemd_test_helper.cpp PRE-CREATION 
  src/tests/linux/systemd_test_helper_main.cpp PRE-CREATION 
  src/tests/linux/systemd_tests.cpp PRE-CREATION 

Diff: https://reviews.apache.org/r/50540/diff/


Testing
---

Tested by sending SIGSTOP to running mesos and verifying via journalctl that it 
was killed by the watchdog.

The test I wrote for this does the following:
- build up a unit file as a string and create a unit file in 
/etc/systemd/system/systemd-test-helper.service
- reload the systemd daemon and start the newly discovered helper service
- wait a bit (30s) to make sure the watchdog has had a chance to kill the 
service
- use systemctl status systemd-test-helper to check that the service is still 
running
- clean up the unit file.

TODO: create a similar test, but send a SIGSTOP to the service and ensure that 
it has been killed by watchdog.


Thanks,

Lawrence Wu



Review Request 50540: Add systemd watchdog support.

2016-07-27 Thread Lawrence Wu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50540/
---

Review request for mesos, David Robinson and Ian Downes.


Bugs: MESOS-5376
https://issues.apache.org/jira/browse/MESOS-5376


Repository: mesos


Description
---

Adds systemd watchdog support (see 
http://0pointer.de/blog/projects/watchdog.html for context).


Diffs
-

  configure.ac d2136909b7305498ae901a5ea00133142b77f9e6 
  src/Makefile.am 599ebbef6d164fb2a530b55427ddabb5cd607634 
  src/linux/systemd.hpp 91134f1d4b100759e45931bd09ca4e1e1aeaaf8a 
  src/linux/systemd.cpp 619aa2778da5f99d3a078a8e1208bdaa9dc77581 
  src/slave/main.cpp 4624392d30cf391015dcd63f447fe2414a47a16a 
  src/tests/linux/systemd-test.service PRE-CREATION 
  src/tests/linux/systemd_test_helper.hpp PRE-CREATION 
  src/tests/linux/systemd_test_helper.cpp PRE-CREATION 
  src/tests/linux/systemd_test_helper_main.cpp PRE-CREATION 
  src/tests/linux/systemd_tests.cpp PRE-CREATION 

Diff: https://reviews.apache.org/r/50540/diff/


Testing
---

Tested by sending SIGSTOP to running mesos and verifying via journalctl that it 
was killed by the watchdog.

The test I wrote for this does the following:
- build up a unit file as a string and create a unit file in 
/etc/systemd/system/systemd-test-helper.service
- reload the systemd daemon and start the newly discovered helper service
- wait a bit (30s) to make sure the watchdog has had a chance to kill the 
service
- use systemctl status systemd-test-helper to check that the service is still 
running
- clean up the unit file.

TODO: create a similar test, but send a SIGSTOP to the service and ensure that 
it has been killed by watchdog.


Thanks,

Lawrence Wu