Re: Review Request 26383: Health Check Disabler

2014-10-17 Thread David Pan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/
---

(Updated Oct. 17, 2014, 10:09 a.m.)


Review request for Aurora, Joe Smith, Maxim Khutornenko, Bill Farner, and 
Zameer Manji.


Changes
---

+wfarner


Bugs: AURORA-795
https://issues.apache.org/jira/browse/AURORA-795


Repository: aurora


Description
---

The health check disabler allows health checks for a job to be snoozed 
temporarily by touching a snooze file in the job's sandbox.  The appropriate 
unit tests were modified/added.

The corresponding JIRA ticket is the following:
https://issues.apache.org/jira/browse/AURORA-795


Diffs
-

  docs/user-guide.md e12ee89b043e34d06348ae3ddbbe8363244fca7c 
  src/main/python/apache/aurora/executor/common/health_checker.py 
4980411c847d12655cbb363404707ebd9f0bd163 
  src/test/python/apache/aurora/executor/common/BUILD 
c7f7a003c865d479ba6e3cd7b5349322f884f653 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
aa36415fa891fc523a3a376ffeca5d3cd5ceabec 

Diff: https://reviews.apache.org/r/26383/diff/


Testing
---

On vagrant in ~/aurora, I ran
./pants src/test/python/apache/aurora/executor::


Thanks,

David Pan



Re: Review Request 26383: Health Check Disabler

2014-10-17 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review57163
---



src/main/python/apache/aurora/executor/common/health_checker.py
https://reviews.apache.org/r/26383/#comment97643

This is shorter: 

if self.sandbox and self.sandbox.exists():



src/main/python/apache/aurora/executor/common/health_checker.py
https://reviews.apache.org/r/26383/#comment97644

same here: if self.snooze_file and ...



src/main/python/apache/aurora/executor/common/health_checker.py
https://reviews.apache.org/r/26383/#comment97656

How about Health check snooze file found at: path...?



src/test/python/apache/aurora/executor/common/test_health_checker.py
https://reviews.apache.org/r/26383/#comment97662

Hm... Does it pass isort-check? You can run 
./build-support/hooks/pre-commit to verify.


- Maxim Khutornenko


On Oct. 17, 2014, 5:09 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 17, 2014, 5:09 p.m.)
 
 
 Review request for Aurora, Joe Smith, Maxim Khutornenko, Bill Farner, and 
 Zameer Manji.
 
 
 Bugs: AURORA-795
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   docs/user-guide.md e12ee89b043e34d06348ae3ddbbe8363244fca7c 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-17 Thread David Pan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/
---

(Updated Oct. 17, 2014, 11:28 a.m.)


Review request for Aurora, Joe Smith, Maxim Khutornenko, Bill Farner, and 
Zameer Manji.


Changes
---

Made changes suggested by Maxim.


Bugs: AURORA-795
https://issues.apache.org/jira/browse/AURORA-795


Repository: aurora


Description
---

The health check disabler allows health checks for a job to be snoozed 
temporarily by touching a snooze file in the job's sandbox.  The appropriate 
unit tests were modified/added.

The corresponding JIRA ticket is the following:
https://issues.apache.org/jira/browse/AURORA-795


Diffs (updated)
-

  docs/user-guide.md e12ee89b043e34d06348ae3ddbbe8363244fca7c 
  src/main/python/apache/aurora/executor/common/health_checker.py 
4980411c847d12655cbb363404707ebd9f0bd163 
  src/test/python/apache/aurora/executor/common/BUILD 
c7f7a003c865d479ba6e3cd7b5349322f884f653 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
aa36415fa891fc523a3a376ffeca5d3cd5ceabec 

Diff: https://reviews.apache.org/r/26383/diff/


Testing
---

On vagrant in ~/aurora, I ran
./pants src/test/python/apache/aurora/executor::


Thanks,

David Pan



Re: Review Request 26383: Health Check Disabler

2014-10-17 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review57197
---

Ship it!


Ship It!

- Maxim Khutornenko


On Oct. 17, 2014, 6:28 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 17, 2014, 6:28 p.m.)
 
 
 Review request for Aurora, Joe Smith, Maxim Khutornenko, Bill Farner, and 
 Zameer Manji.
 
 
 Bugs: AURORA-795
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   docs/user-guide.md e12ee89b043e34d06348ae3ddbbe8363244fca7c 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-17 Thread David Pan


 On Oct. 17, 2014, 12:18 p.m., Maxim Khutornenko wrote:
  Ship It!

Bill Farner mentioned that once you give a shipit, it is good to go (and that 
he is not needed for the review).  Is it okay if you commit the change?

Thanks!


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review57197
---


On Oct. 17, 2014, 12:26 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 17, 2014, 12:26 p.m.)
 
 
 Review request for Aurora, Joe Smith, Maxim Khutornenko, and Zameer Manji.
 
 
 Bugs: AURORA-795
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   docs/user-guide.md e12ee89b043e34d06348ae3ddbbe8363244fca7c 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-17 Thread Maxim Khutornenko


 On Oct. 17, 2014, 7:30 p.m., David Pan wrote:
  Ship It!

This is on master. Thanks!


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review57199
---


On Oct. 17, 2014, 7:26 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 17, 2014, 7:26 p.m.)
 
 
 Review request for Aurora, Joe Smith, Maxim Khutornenko, and Zameer Manji.
 
 
 Bugs: AURORA-795
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   docs/user-guide.md e12ee89b043e34d06348ae3ddbbe8363244fca7c 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-16 Thread David Pan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/
---

(Updated Oct. 16, 2014, 9:47 a.m.)


Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.


Changes
---

Added Snooze Health Check section to user guide.  Minor modifications to mocks.


Bugs: AURORA-795
https://issues.apache.org/jira/browse/AURORA-795


Repository: aurora


Description
---

The health check disabler allows health checks for a job to be snoozed 
temporarily by touching a snooze file in the job's sandbox.  The appropriate 
unit tests were modified/added.

The corresponding JIRA ticket is the following:
https://issues.apache.org/jira/browse/AURORA-795


Diffs (updated)
-

  docs/user-guide.md e12ee89b043e34d06348ae3ddbbe8363244fca7c 
  src/main/python/apache/aurora/executor/common/health_checker.py 
4980411c847d12655cbb363404707ebd9f0bd163 
  src/test/python/apache/aurora/executor/common/BUILD 
c7f7a003c865d479ba6e3cd7b5349322f884f653 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
aa36415fa891fc523a3a376ffeca5d3cd5ceabec 

Diff: https://reviews.apache.org/r/26383/diff/


Testing
---

On vagrant in ~/aurora, I ran
./pants src/test/python/apache/aurora/executor::


Thanks,

David Pan



Re: Review Request 26383: Health Check Disabler

2014-10-16 Thread David Pan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/
---

(Updated Oct. 16, 2014, 3 p.m.)


Review request for Aurora, Joe Smith, Maxim Khutornenko, and Zameer Manji.


Changes
---

+maxim, -wickman


Bugs: AURORA-795
https://issues.apache.org/jira/browse/AURORA-795


Repository: aurora


Description
---

The health check disabler allows health checks for a job to be snoozed 
temporarily by touching a snooze file in the job's sandbox.  The appropriate 
unit tests were modified/added.

The corresponding JIRA ticket is the following:
https://issues.apache.org/jira/browse/AURORA-795


Diffs
-

  docs/user-guide.md e12ee89b043e34d06348ae3ddbbe8363244fca7c 
  src/main/python/apache/aurora/executor/common/health_checker.py 
4980411c847d12655cbb363404707ebd9f0bd163 
  src/test/python/apache/aurora/executor/common/BUILD 
c7f7a003c865d479ba6e3cd7b5349322f884f653 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
aa36415fa891fc523a3a376ffeca5d3cd5ceabec 

Diff: https://reviews.apache.org/r/26383/diff/


Testing
---

On vagrant in ~/aurora, I ran
./pants src/test/python/apache/aurora/executor::


Thanks,

David Pan



Re: Review Request 26383: Health Check Disabler

2014-10-16 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review57019
---

Ship it!


Ship It!

- Zameer Manji


On Oct. 16, 2014, 3:04 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 16, 2014, 3:04 p.m.)
 
 
 Review request for Aurora, Joe Smith, Maxim Khutornenko, and Zameer Manji.
 
 
 Bugs: AURORA-795
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   docs/user-guide.md e12ee89b043e34d06348ae3ddbbe8363244fca7c 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-13 Thread David Pan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/
---

(Updated Oct. 13, 2014, 6:22 p.m.)


Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.


Changes
---

Added bug.

(-wfarner)


Bugs: AURORA-795
https://issues.apache.org/jira/browse/AURORA-795


Repository: aurora


Description
---

The health check disabler allows health checks for a job to be snoozed 
temporarily by touching a snooze file in the job's sandbox.  The appropriate 
unit tests were modified/added.

The corresponding JIRA ticket is the following:
https://issues.apache.org/jira/browse/AURORA-795


Diffs
-

  src/main/python/apache/aurora/executor/common/health_checker.py 
4980411c847d12655cbb363404707ebd9f0bd163 
  src/test/python/apache/aurora/executor/common/BUILD 
c7f7a003c865d479ba6e3cd7b5349322f884f653 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
aa36415fa891fc523a3a376ffeca5d3cd5ceabec 

Diff: https://reviews.apache.org/r/26383/diff/


Testing
---

On vagrant in ~/aurora, I ran
./pants src/test/python/apache/aurora/executor::


Thanks,

David Pan



Re: Review Request 26383: Health Check Disabler

2014-10-13 Thread David Pan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/
---

(Updated Oct. 13, 2014, 12:27 p.m.)


Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.


Changes
---

Removed time.


Bugs: AURORA-795
https://issues.apache.org/jira/browse/AURORA-795


Repository: aurora


Description
---

The health check disabler allows health checks for a job to be snoozed 
temporarily by touching a snooze file in the job's sandbox.  The appropriate 
unit tests were modified/added.

The corresponding JIRA ticket is the following:
https://issues.apache.org/jira/browse/AURORA-795


Diffs (updated)
-

  src/main/python/apache/aurora/executor/common/health_checker.py 
4980411c847d12655cbb363404707ebd9f0bd163 
  src/test/python/apache/aurora/executor/common/BUILD 
c7f7a003c865d479ba6e3cd7b5349322f884f653 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
aa36415fa891fc523a3a376ffeca5d3cd5ceabec 

Diff: https://reviews.apache.org/r/26383/diff/


Testing
---

On vagrant in ~/aurora, I ran
./pants src/test/python/apache/aurora/executor::


Thanks,

David Pan



Re: Review Request 26383: Health Check Disabler

2014-10-13 Thread David Pan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/
---

(Updated Oct. 13, 2014, 1:53 p.m.)


Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.


Changes
---

Change to log.info for Health check snooze file found ...


Bugs: AURORA-795
https://issues.apache.org/jira/browse/AURORA-795


Repository: aurora


Description
---

The health check disabler allows health checks for a job to be snoozed 
temporarily by touching a snooze file in the job's sandbox.  The appropriate 
unit tests were modified/added.

The corresponding JIRA ticket is the following:
https://issues.apache.org/jira/browse/AURORA-795


Diffs (updated)
-

  src/main/python/apache/aurora/executor/common/health_checker.py 
4980411c847d12655cbb363404707ebd9f0bd163 
  src/test/python/apache/aurora/executor/common/BUILD 
c7f7a003c865d479ba6e3cd7b5349322f884f653 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
aa36415fa891fc523a3a376ffeca5d3cd5ceabec 

Diff: https://reviews.apache.org/r/26383/diff/


Testing
---

On vagrant in ~/aurora, I ran
./pants src/test/python/apache/aurora/executor::


Thanks,

David Pan



Re: Review Request 26383: Health Check Disabler

2014-10-10 Thread David Pan


 On Oct. 9, 2014, 2:53 p.m., Bill Farner wrote:
  src/main/python/apache/aurora/executor/common/health_checker.py, line 66
  https://reviews.apache.org/r/26383/diff/3/?file=716355#file716355line66
 
  FWIW i actually meant to suggest that the snooze has no concept of time 
  at all.  If the file is there, don't perform health checks.  When you want 
  to re-enable health checks, delete the file.  Happy to hear what others 
  think about that.
 
 Zameer Manji wrote:
 Doesn't this mean user error can disable health checks forever? I think 
 we should treat disabling health checking as an exceptional state (since the 
 proceses has opted in to health checking) and therefore require user action 
 (increasing mtime) to continue to stay in this state.
 
 Kevin Sweeney wrote:
 Presumably we can trust the user here - health checking is after all 
 opt-in via {{thermos.ports[health]}}. Making it a simple on/off switch 
 greatly simplifies the code on our end.
 
 Zameer Manji wrote:
 I'm very worried about introducing a feature which can allow unhealthy 
 instances live forever. Furthermore this information isn't exposed on the 
 Aurora UI or Observer UI so there isn't an easy way to check if an instance 
 has health checking disabled or not. A user can be completely oblivious that 
 health checking of their instance has been disabled.
 
 David Pan wrote:
 The motivation behind the health check disabler is for another project I 
 am working on, which is the JVM Heap Dumper.  Long story short, the JVM Heap 
 Dumper needs to disable health checks for the task that is being heap dumped. 
  Under normal circumstances, the heap dumper can re-enable the health checks. 
  However, if the heap dumper dies unexpectedly, the health checks will remain 
 disabled forever if we don't implement time control.  We don't want the 
 disabling and enabling of health checks to be a manual process.
 
 Kevin Sweeney wrote:
 I still agree with Bill here - the simpler implementation (skip health 
 check if the file exists) is easier to reason about and a cleaner API.
 
 A process running in the sandbox could implement some form of timeout 
 logic itself (for example, the application could expose an endpoint that 
 forks the equivalent of touch .healthchecksnooze; sleep 60; rm 
 .healthchecksnooze) but I like the idea of keeping the executor API here 
 simple to explain and reason about.
 
 David Pan wrote:
 Overall, wouldn't it be simpler for the executor API to handle the time 
 control, rather than requiring an identical process in every task's sandbox 
 to handle the time control?
 
 Joshua Cohen wrote:
 I feel like we should err on the side of correctness here, rather than 
 simplicity? The dangers of someone accidentally leaving health checks 
 disabled indefinitely (on a service that has opted in to health checks) are 
 not insignificant.

I feel like this discussion warrants a short meeting.  I'll set something up on 
the calendar.


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55985
---


On Oct. 9, 2014, 1:56 a.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 9, 2014, 1:56 a.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-10 Thread Bill Farner


 On Oct. 9, 2014, 2:53 p.m., Bill Farner wrote:
  src/main/python/apache/aurora/executor/common/health_checker.py, line 66
  https://reviews.apache.org/r/26383/diff/3/?file=716355#file716355line66
 
  FWIW i actually meant to suggest that the snooze has no concept of time 
  at all.  If the file is there, don't perform health checks.  When you want 
  to re-enable health checks, delete the file.  Happy to hear what others 
  think about that.
 
 Zameer Manji wrote:
 Doesn't this mean user error can disable health checks forever? I think 
 we should treat disabling health checking as an exceptional state (since the 
 proceses has opted in to health checking) and therefore require user action 
 (increasing mtime) to continue to stay in this state.
 
 Kevin Sweeney wrote:
 Presumably we can trust the user here - health checking is after all 
 opt-in via {{thermos.ports[health]}}. Making it a simple on/off switch 
 greatly simplifies the code on our end.
 
 Zameer Manji wrote:
 I'm very worried about introducing a feature which can allow unhealthy 
 instances live forever. Furthermore this information isn't exposed on the 
 Aurora UI or Observer UI so there isn't an easy way to check if an instance 
 has health checking disabled or not. A user can be completely oblivious that 
 health checking of their instance has been disabled.
 
 David Pan wrote:
 The motivation behind the health check disabler is for another project I 
 am working on, which is the JVM Heap Dumper.  Long story short, the JVM Heap 
 Dumper needs to disable health checks for the task that is being heap dumped. 
  Under normal circumstances, the heap dumper can re-enable the health checks. 
  However, if the heap dumper dies unexpectedly, the health checks will remain 
 disabled forever if we don't implement time control.  We don't want the 
 disabling and enabling of health checks to be a manual process.
 
 Kevin Sweeney wrote:
 I still agree with Bill here - the simpler implementation (skip health 
 check if the file exists) is easier to reason about and a cleaner API.
 
 A process running in the sandbox could implement some form of timeout 
 logic itself (for example, the application could expose an endpoint that 
 forks the equivalent of touch .healthchecksnooze; sleep 60; rm 
 .healthchecksnooze) but I like the idea of keeping the executor API here 
 simple to explain and reason about.
 
 David Pan wrote:
 Overall, wouldn't it be simpler for the executor API to handle the time 
 control, rather than requiring an identical process in every task's sandbox 
 to handle the time control?
 
 Joshua Cohen wrote:
 I feel like we should err on the side of correctness here, rather than 
 simplicity? The dangers of someone accidentally leaving health checks 
 disabled indefinitely (on a service that has opted in to health checks) are 
 not insignificant.
 
 David Pan wrote:
 I feel like this discussion warrants a short meeting.  I'll set something 
 up on the calendar.

David - as an open source project, the decision should not happen behind closed 
doors.  An e-mail thread to dev@ is the best course from here.  I'd be happy to 
make the call, but it's always nice to give other intersted parties a chance to 
chime in.


- Bill


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55985
---


On Oct. 9, 2014, 1:56 a.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 9, 2014, 1:56 a.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-10 Thread David Pan


 On Oct. 9, 2014, 2:53 p.m., Bill Farner wrote:
  src/main/python/apache/aurora/executor/common/health_checker.py, line 66
  https://reviews.apache.org/r/26383/diff/3/?file=716355#file716355line66
 
  FWIW i actually meant to suggest that the snooze has no concept of time 
  at all.  If the file is there, don't perform health checks.  When you want 
  to re-enable health checks, delete the file.  Happy to hear what others 
  think about that.
 
 Zameer Manji wrote:
 Doesn't this mean user error can disable health checks forever? I think 
 we should treat disabling health checking as an exceptional state (since the 
 proceses has opted in to health checking) and therefore require user action 
 (increasing mtime) to continue to stay in this state.
 
 Kevin Sweeney wrote:
 Presumably we can trust the user here - health checking is after all 
 opt-in via {{thermos.ports[health]}}. Making it a simple on/off switch 
 greatly simplifies the code on our end.
 
 Zameer Manji wrote:
 I'm very worried about introducing a feature which can allow unhealthy 
 instances live forever. Furthermore this information isn't exposed on the 
 Aurora UI or Observer UI so there isn't an easy way to check if an instance 
 has health checking disabled or not. A user can be completely oblivious that 
 health checking of their instance has been disabled.
 
 David Pan wrote:
 The motivation behind the health check disabler is for another project I 
 am working on, which is the JVM Heap Dumper.  Long story short, the JVM Heap 
 Dumper needs to disable health checks for the task that is being heap dumped. 
  Under normal circumstances, the heap dumper can re-enable the health checks. 
  However, if the heap dumper dies unexpectedly, the health checks will remain 
 disabled forever if we don't implement time control.  We don't want the 
 disabling and enabling of health checks to be a manual process.
 
 Kevin Sweeney wrote:
 I still agree with Bill here - the simpler implementation (skip health 
 check if the file exists) is easier to reason about and a cleaner API.
 
 A process running in the sandbox could implement some form of timeout 
 logic itself (for example, the application could expose an endpoint that 
 forks the equivalent of touch .healthchecksnooze; sleep 60; rm 
 .healthchecksnooze) but I like the idea of keeping the executor API here 
 simple to explain and reason about.
 
 David Pan wrote:
 Overall, wouldn't it be simpler for the executor API to handle the time 
 control, rather than requiring an identical process in every task's sandbox 
 to handle the time control?
 
 Joshua Cohen wrote:
 I feel like we should err on the side of correctness here, rather than 
 simplicity? The dangers of someone accidentally leaving health checks 
 disabled indefinitely (on a service that has opted in to health checks) are 
 not insignificant.
 
 David Pan wrote:
 I feel like this discussion warrants a short meeting.  I'll set something 
 up on the calendar.
 
 Bill Farner wrote:
 David - as an open source project, the decision should not happen behind 
 closed doors.  An e-mail thread to dev@ is the best course from here.  I'd be 
 happy to make the call, but it's always nice to give other intersted parties 
 a chance to chime in.

Bill - Ok, I understand.  So should I send an email to dev@ briefing the 
situation (with references to this code review), and invite interested parties 
to chime in?


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55985
---


On Oct. 9, 2014, 1:56 a.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 9, 2014, 1:56 a.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-09 Thread Bill Farner

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55985
---



src/main/python/apache/aurora/executor/common/health_checker.py
https://reviews.apache.org/r/26383/#comment96330

FWIW i actually meant to suggest that the snooze has no concept of time at 
all.  If the file is there, don't perform health checks.  When you want to 
re-enable health checks, delete the file.  Happy to hear what others think 
about that.


- Bill Farner


On Oct. 9, 2014, 1:56 a.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 9, 2014, 1:56 a.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-09 Thread Zameer Manji


 On Oct. 9, 2014, 7:53 a.m., Bill Farner wrote:
  src/main/python/apache/aurora/executor/common/health_checker.py, line 66
  https://reviews.apache.org/r/26383/diff/3/?file=716355#file716355line66
 
  FWIW i actually meant to suggest that the snooze has no concept of time 
  at all.  If the file is there, don't perform health checks.  When you want 
  to re-enable health checks, delete the file.  Happy to hear what others 
  think about that.

Doesn't this mean user error can disable health checks forever? I think we 
should treat disabling health checking as an exceptional state (since the 
proceses has opted in to health checking) and therefore require user action 
(increasing mtime) to continue to stay in this state.


- Zameer


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55985
---


On Oct. 8, 2014, 6:56 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 8, 2014, 6:56 p.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-09 Thread Kevin Sweeney


 On Oct. 9, 2014, 7:53 a.m., Bill Farner wrote:
  src/main/python/apache/aurora/executor/common/health_checker.py, line 66
  https://reviews.apache.org/r/26383/diff/3/?file=716355#file716355line66
 
  FWIW i actually meant to suggest that the snooze has no concept of time 
  at all.  If the file is there, don't perform health checks.  When you want 
  to re-enable health checks, delete the file.  Happy to hear what others 
  think about that.
 
 Zameer Manji wrote:
 Doesn't this mean user error can disable health checks forever? I think 
 we should treat disabling health checking as an exceptional state (since the 
 proceses has opted in to health checking) and therefore require user action 
 (increasing mtime) to continue to stay in this state.

Presumably we can trust the user here - health checking is after all opt-in via 
{{thermos.ports[health]}}. Making it a simple on/off switch greatly simplifies 
the code on our end.


- Kevin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55985
---


On Oct. 8, 2014, 6:56 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 8, 2014, 6:56 p.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-09 Thread Zameer Manji


 On Oct. 9, 2014, 7:53 a.m., Bill Farner wrote:
  src/main/python/apache/aurora/executor/common/health_checker.py, line 66
  https://reviews.apache.org/r/26383/diff/3/?file=716355#file716355line66
 
  FWIW i actually meant to suggest that the snooze has no concept of time 
  at all.  If the file is there, don't perform health checks.  When you want 
  to re-enable health checks, delete the file.  Happy to hear what others 
  think about that.
 
 Zameer Manji wrote:
 Doesn't this mean user error can disable health checks forever? I think 
 we should treat disabling health checking as an exceptional state (since the 
 proceses has opted in to health checking) and therefore require user action 
 (increasing mtime) to continue to stay in this state.
 
 Kevin Sweeney wrote:
 Presumably we can trust the user here - health checking is after all 
 opt-in via {{thermos.ports[health]}}. Making it a simple on/off switch 
 greatly simplifies the code on our end.

I'm very worried about introducing a feature which can allow unhealthy 
instances live forever. Furthermore this information isn't exposed on the 
Aurora UI or Observer UI so there isn't an easy way to check if an instance has 
health checking disabled or not. A user can be completely oblivious that health 
checking of their instance has been disabled.


- Zameer


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55985
---


On Oct. 8, 2014, 6:56 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 8, 2014, 6:56 p.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-09 Thread David Pan


 On Oct. 9, 2014, 2:53 p.m., Bill Farner wrote:
  src/main/python/apache/aurora/executor/common/health_checker.py, line 66
  https://reviews.apache.org/r/26383/diff/3/?file=716355#file716355line66
 
  FWIW i actually meant to suggest that the snooze has no concept of time 
  at all.  If the file is there, don't perform health checks.  When you want 
  to re-enable health checks, delete the file.  Happy to hear what others 
  think about that.
 
 Zameer Manji wrote:
 Doesn't this mean user error can disable health checks forever? I think 
 we should treat disabling health checking as an exceptional state (since the 
 proceses has opted in to health checking) and therefore require user action 
 (increasing mtime) to continue to stay in this state.
 
 Kevin Sweeney wrote:
 Presumably we can trust the user here - health checking is after all 
 opt-in via {{thermos.ports[health]}}. Making it a simple on/off switch 
 greatly simplifies the code on our end.
 
 Zameer Manji wrote:
 I'm very worried about introducing a feature which can allow unhealthy 
 instances live forever. Furthermore this information isn't exposed on the 
 Aurora UI or Observer UI so there isn't an easy way to check if an instance 
 has health checking disabled or not. A user can be completely oblivious that 
 health checking of their instance has been disabled.

The motivation behind the health check disabler is for another project I am 
working on, which is the JVM Heap Dumper.  Long story short, the JVM Heap 
Dumper needs to disable health checks for the task that is being heap dumped.  
Under normal circumstances, the heap dumper can re-enable the health checks.  
However, if the heap dumper dies unexpectedly, the health checks will remain 
disabled forever if we don't implement time control.  We don't want the 
disabling and enabling of health checks to be a manual process.


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55985
---


On Oct. 9, 2014, 1:56 a.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 9, 2014, 1:56 a.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-09 Thread Kevin Sweeney


 On Oct. 9, 2014, 7:53 a.m., Bill Farner wrote:
  src/main/python/apache/aurora/executor/common/health_checker.py, line 66
  https://reviews.apache.org/r/26383/diff/3/?file=716355#file716355line66
 
  FWIW i actually meant to suggest that the snooze has no concept of time 
  at all.  If the file is there, don't perform health checks.  When you want 
  to re-enable health checks, delete the file.  Happy to hear what others 
  think about that.
 
 Zameer Manji wrote:
 Doesn't this mean user error can disable health checks forever? I think 
 we should treat disabling health checking as an exceptional state (since the 
 proceses has opted in to health checking) and therefore require user action 
 (increasing mtime) to continue to stay in this state.
 
 Kevin Sweeney wrote:
 Presumably we can trust the user here - health checking is after all 
 opt-in via {{thermos.ports[health]}}. Making it a simple on/off switch 
 greatly simplifies the code on our end.
 
 Zameer Manji wrote:
 I'm very worried about introducing a feature which can allow unhealthy 
 instances live forever. Furthermore this information isn't exposed on the 
 Aurora UI or Observer UI so there isn't an easy way to check if an instance 
 has health checking disabled or not. A user can be completely oblivious that 
 health checking of their instance has been disabled.
 
 David Pan wrote:
 The motivation behind the health check disabler is for another project I 
 am working on, which is the JVM Heap Dumper.  Long story short, the JVM Heap 
 Dumper needs to disable health checks for the task that is being heap dumped. 
  Under normal circumstances, the heap dumper can re-enable the health checks. 
  However, if the heap dumper dies unexpectedly, the health checks will remain 
 disabled forever if we don't implement time control.  We don't want the 
 disabling and enabling of health checks to be a manual process.

I still agree with Bill here - the simpler implementation (skip health check if 
the file exists) is easier to reason about and a cleaner API.

A process running in the sandbox could implement some form of timeout logic 
itself (for example, the application could expose an endpoint that forks the 
equivalent of touch .healthchecksnooze; sleep 60; rm .healthchecksnooze) but 
I like the idea of keeping the executor API here simple to explain and reason 
about.


- Kevin


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55985
---


On Oct. 8, 2014, 6:56 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 8, 2014, 6:56 p.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-09 Thread David Pan


 On Oct. 9, 2014, 2:53 p.m., Bill Farner wrote:
  src/main/python/apache/aurora/executor/common/health_checker.py, line 66
  https://reviews.apache.org/r/26383/diff/3/?file=716355#file716355line66
 
  FWIW i actually meant to suggest that the snooze has no concept of time 
  at all.  If the file is there, don't perform health checks.  When you want 
  to re-enable health checks, delete the file.  Happy to hear what others 
  think about that.
 
 Zameer Manji wrote:
 Doesn't this mean user error can disable health checks forever? I think 
 we should treat disabling health checking as an exceptional state (since the 
 proceses has opted in to health checking) and therefore require user action 
 (increasing mtime) to continue to stay in this state.
 
 Kevin Sweeney wrote:
 Presumably we can trust the user here - health checking is after all 
 opt-in via {{thermos.ports[health]}}. Making it a simple on/off switch 
 greatly simplifies the code on our end.
 
 Zameer Manji wrote:
 I'm very worried about introducing a feature which can allow unhealthy 
 instances live forever. Furthermore this information isn't exposed on the 
 Aurora UI or Observer UI so there isn't an easy way to check if an instance 
 has health checking disabled or not. A user can be completely oblivious that 
 health checking of their instance has been disabled.
 
 David Pan wrote:
 The motivation behind the health check disabler is for another project I 
 am working on, which is the JVM Heap Dumper.  Long story short, the JVM Heap 
 Dumper needs to disable health checks for the task that is being heap dumped. 
  Under normal circumstances, the heap dumper can re-enable the health checks. 
  However, if the heap dumper dies unexpectedly, the health checks will remain 
 disabled forever if we don't implement time control.  We don't want the 
 disabling and enabling of health checks to be a manual process.
 
 Kevin Sweeney wrote:
 I still agree with Bill here - the simpler implementation (skip health 
 check if the file exists) is easier to reason about and a cleaner API.
 
 A process running in the sandbox could implement some form of timeout 
 logic itself (for example, the application could expose an endpoint that 
 forks the equivalent of touch .healthchecksnooze; sleep 60; rm 
 .healthchecksnooze) but I like the idea of keeping the executor API here 
 simple to explain and reason about.

Overall, wouldn't it be simpler for the executor API to handle the time 
control, rather than requiring an identical process in every task's sandbox to 
handle the time control?


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55985
---


On Oct. 9, 2014, 1:56 a.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 9, 2014, 1:56 a.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-09 Thread Joshua Cohen
I feel like we should err on the side of correctness here, rather than
simplicity? The dangers of someone accidentally leaving health checks
disabled indefinitely (on a service that has opted in to health checks) are
not insignificant.

On Thu, Oct 9, 2014 at 1:34 PM, Kevin Sweeney kevi...@apache.org wrote:



  On Oct. 9, 2014, 7:53 a.m., Bill Farner wrote:
   src/main/python/apache/aurora/executor/common/health_checker.py, line
 66
   
 https://reviews.apache.org/r/26383/diff/3/?file=716355#file716355line66
  
   FWIW i actually meant to suggest that the snooze has no concept of
 time at all.  If the file is there, don't perform health checks.  When you
 want to re-enable health checks, delete the file.  Happy to hear what
 others think about that.
 
  Zameer Manji wrote:
  Doesn't this mean user error can disable health checks forever? I
 think we should treat disabling health checking as an exceptional state
 (since the proceses has opted in to health checking) and therefore require
 user action (increasing mtime) to continue to stay in this state.
 
  Kevin Sweeney wrote:
  Presumably we can trust the user here - health checking is after all
 opt-in via {{thermos.ports[health]}}. Making it a simple on/off switch
 greatly simplifies the code on our end.
 
  Zameer Manji wrote:
  I'm very worried about introducing a feature which can allow
 unhealthy instances live forever. Furthermore this information isn't
 exposed on the Aurora UI or Observer UI so there isn't an easy way to check
 if an instance has health checking disabled or not. A user can be
 completely oblivious that health checking of their instance has been
 disabled.
 
  David Pan wrote:
  The motivation behind the health check disabler is for another
 project I am working on, which is the JVM Heap Dumper.  Long story short,
 the JVM Heap Dumper needs to disable health checks for the task that is
 being heap dumped.  Under normal circumstances, the heap dumper can
 re-enable the health checks.  However, if the heap dumper dies
 unexpectedly, the health checks will remain disabled forever if we don't
 implement time control.  We don't want the disabling and enabling of health
 checks to be a manual process.

 I still agree with Bill here - the simpler implementation (skip health
 check if the file exists) is easier to reason about and a cleaner API.

 A process running in the sandbox could implement some form of timeout
 logic itself (for example, the application could expose an endpoint that
 forks the equivalent of touch .healthchecksnooze; sleep 60; rm
 .healthchecksnooze) but I like the idea of keeping the executor API here
 simple to explain and reason about.


 - Kevin


 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/#review55985
 ---


 On Oct. 8, 2014, 6:56 p.m., David Pan wrote:
 
  ---
  This is an automatically generated e-mail. To reply, visit:
  https://reviews.apache.org/r/26383/
  ---
 
  (Updated Oct. 8, 2014, 6:56 p.m.)
 
 
  Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
  Repository: aurora
 
 
  Description
  ---
 
  The health check disabler allows health checks for a job to be snoozed
 temporarily by touching a snooze file in the job's sandbox.  The
 appropriate unit tests were modified/added.
 
  The corresponding JIRA ticket is the following:
  https://issues.apache.org/jira/browse/AURORA-795
 
 
  Diffs
  -
 
src/main/python/apache/aurora/executor/common/health_checker.py
 4980411c847d12655cbb363404707ebd9f0bd163
src/test/python/apache/aurora/executor/common/BUILD
 c7f7a003c865d479ba6e3cd7b5349322f884f653
src/test/python/apache/aurora/executor/common/test_health_checker.py
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec
 
  Diff: https://reviews.apache.org/r/26383/diff/
 
 
  Testing
  ---
 
  On vagrant in ~/aurora, I ran
  ./pants src/test/python/apache/aurora/executor::
 
 
  Thanks,
 
  David Pan
 
 




Re: Review Request 26383: Health Check Disabler

2014-10-09 Thread Joshua Cohen


 On Oct. 9, 2014, 2:53 p.m., Bill Farner wrote:
  src/main/python/apache/aurora/executor/common/health_checker.py, line 66
  https://reviews.apache.org/r/26383/diff/3/?file=716355#file716355line66
 
  FWIW i actually meant to suggest that the snooze has no concept of time 
  at all.  If the file is there, don't perform health checks.  When you want 
  to re-enable health checks, delete the file.  Happy to hear what others 
  think about that.
 
 Zameer Manji wrote:
 Doesn't this mean user error can disable health checks forever? I think 
 we should treat disabling health checking as an exceptional state (since the 
 proceses has opted in to health checking) and therefore require user action 
 (increasing mtime) to continue to stay in this state.
 
 Kevin Sweeney wrote:
 Presumably we can trust the user here - health checking is after all 
 opt-in via {{thermos.ports[health]}}. Making it a simple on/off switch 
 greatly simplifies the code on our end.
 
 Zameer Manji wrote:
 I'm very worried about introducing a feature which can allow unhealthy 
 instances live forever. Furthermore this information isn't exposed on the 
 Aurora UI or Observer UI so there isn't an easy way to check if an instance 
 has health checking disabled or not. A user can be completely oblivious that 
 health checking of their instance has been disabled.
 
 David Pan wrote:
 The motivation behind the health check disabler is for another project I 
 am working on, which is the JVM Heap Dumper.  Long story short, the JVM Heap 
 Dumper needs to disable health checks for the task that is being heap dumped. 
  Under normal circumstances, the heap dumper can re-enable the health checks. 
  However, if the heap dumper dies unexpectedly, the health checks will remain 
 disabled forever if we don't implement time control.  We don't want the 
 disabling and enabling of health checks to be a manual process.
 
 Kevin Sweeney wrote:
 I still agree with Bill here - the simpler implementation (skip health 
 check if the file exists) is easier to reason about and a cleaner API.
 
 A process running in the sandbox could implement some form of timeout 
 logic itself (for example, the application could expose an endpoint that 
 forks the equivalent of touch .healthchecksnooze; sleep 60; rm 
 .healthchecksnooze) but I like the idea of keeping the executor API here 
 simple to explain and reason about.
 
 David Pan wrote:
 Overall, wouldn't it be simpler for the executor API to handle the time 
 control, rather than requiring an identical process in every task's sandbox 
 to handle the time control?

I feel like we should err on the side of correctness here, rather than 
simplicity? The dangers of someone accidentally leaving health checks disabled 
indefinitely (on a service that has opted in to health checks) are not 
insignificant.


- Joshua


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55985
---


On Oct. 9, 2014, 1:56 a.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 9, 2014, 1:56 a.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-08 Thread Bill Farner


 On Oct. 6, 2014, 10:55 p.m., Joshua Cohen wrote:
  src/main/python/apache/aurora/config/schema/base.py, line 69
  https://reviews.apache.org/r/26383/diff/1/?file=714259#file714259line69
 
  Do we need to make this path configurable? I'm trying to think of a 
  reason that someone might want to change it, and I'm coming up blank, 
  however I could envision a scenario where someone does something malicious 
  (intentionally or not) like sets this path to something that shouldn't be 
  removed (e.g. the artifact being run).

+1, knob uneeded


- Bill


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55590
---


On Oct. 6, 2014, 9:24 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 6, 2014, 9:24 p.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The path of the 
 snooze file and the snooze duration can be set in the HealthCheckConfig.  The 
 appropriate unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   docs/configuration-reference.md 5166d45ddf95ae5d8afe39dd3b00654ac91857ec 
   docs/configuration-tutorial.md 67998e9dab6ac429d96d7c0d2df959336b767f32 
   src/main/python/apache/aurora/config/schema/base.py 
 f12634f103c3eb20e43f37c25d9b0fc3e3d228ec 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-08 Thread Bill Farner


 On Oct. 6, 2014, 10:40 p.m., Brian Wickman wrote:
  docs/configuration-reference.md, lines 359-360
  https://reviews.apache.org/r/26383/diff/1/?file=714257#file714257line359
 
  Is there any reason this needs to be configurable?  Why not just 
  hardcode the filename as '.healthchecksnooze' and then allow the user to 
  specify the snooze at runtime, e.g. echo '600'  .healthchecksnooze to 
  sleep for 600 seconds.  (And if the value is malformed, just unlink and 
  don't snooze.)
 
 David Pan wrote:
 A few questions:
 1. In the case if the value is malformed, do we want to force fail the 
 health check in order to alert the user the fact that they failed at setting 
 the snooze duration.  Otherwise, the user might not notice the mistake.
 2. Should we read the value from the file only the first time the snooze 
 file is created.  Or, should we allow the user change the value on the fly.
 
 Brian Wickman wrote:
 1. I think force failing the health check would be counterproductive -- 
 if you accidentally fat finger, you don't want to kill the task which might 
 already be a unicorn.  Instead just unlinking right away or perhaps using a 
 default snooze value.
 
 2. Maybe do mtime + time in the file as the snooze expiration, and 
 re-read if the mtime changed.  This means at least doing a stat() each health 
 check interval, but will allow you to change the snooze on the fly.

How about removing the time control altogether, and let presence of the file 
serve as the snooze?  It's easier to add the time control later than to remove 
it if we decide it's unneeded.  This allows us to sidestep the malformed file 
content discussion.


- Bill


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55592
---


On Oct. 6, 2014, 9:24 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 6, 2014, 9:24 p.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The path of the 
 snooze file and the snooze duration can be set in the HealthCheckConfig.  The 
 appropriate unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   docs/configuration-reference.md 5166d45ddf95ae5d8afe39dd3b00654ac91857ec 
   docs/configuration-tutorial.md 67998e9dab6ac429d96d7c0d2df959336b767f32 
   src/main/python/apache/aurora/config/schema/base.py 
 f12634f103c3eb20e43f37c25d9b0fc3e3d228ec 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-08 Thread Joshua Cohen
I'm +1 on removing the time control as well. If you need to extend the
snooze you could always touch -m the snooze file?

On Wed, Oct 8, 2014 at 9:51 AM, Bill Farner wfar...@apache.org wrote:



  On Oct. 6, 2014, 10:40 p.m., Brian Wickman wrote:
   docs/configuration-reference.md, lines 359-360
   
 https://reviews.apache.org/r/26383/diff/1/?file=714257#file714257line359
  
   Is there any reason this needs to be configurable?  Why not just
 hardcode the filename as '.healthchecksnooze' and then allow the user to
 specify the snooze at runtime, e.g. echo '600'  .healthchecksnooze to
 sleep for 600 seconds.  (And if the value is malformed, just unlink and
 don't snooze.)
 
  David Pan wrote:
  A few questions:
  1. In the case if the value is malformed, do we want to force fail
 the health check in order to alert the user the fact that they failed at
 setting the snooze duration.  Otherwise, the user might not notice the
 mistake.
  2. Should we read the value from the file only the first time the
 snooze file is created.  Or, should we allow the user change the value on
 the fly.
 
  Brian Wickman wrote:
  1. I think force failing the health check would be counterproductive
 -- if you accidentally fat finger, you don't want to kill the task which
 might already be a unicorn.  Instead just unlinking right away or perhaps
 using a default snooze value.
 
  2. Maybe do mtime + time in the file as the snooze expiration, and
 re-read if the mtime changed.  This means at least doing a stat() each
 health check interval, but will allow you to change the snooze on the fly.

 How about removing the time control altogether, and let presence of the
 file serve as the snooze?  It's easier to add the time control later than
 to remove it if we decide it's unneeded.  This allows us to sidestep the
 malformed file content discussion.


 - Bill


 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/#review55592
 ---


 On Oct. 6, 2014, 9:24 p.m., David Pan wrote:
 
  ---
  This is an automatically generated e-mail. To reply, visit:
  https://reviews.apache.org/r/26383/
  ---
 
  (Updated Oct. 6, 2014, 9:24 p.m.)
 
 
  Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
  Repository: aurora
 
 
  Description
  ---
 
  The health check disabler allows health checks for a job to be snoozed
 temporarily by touching a snooze file in the job's sandbox.  The path of
 the snooze file and the snooze duration can be set in the
 HealthCheckConfig.  The appropriate unit tests were modified/added.
 
  The corresponding JIRA ticket is the following:
  https://issues.apache.org/jira/browse/AURORA-795
 
 
  Diffs
  -
 
docs/configuration-reference.md
 5166d45ddf95ae5d8afe39dd3b00654ac91857ec
docs/configuration-tutorial.md
 67998e9dab6ac429d96d7c0d2df959336b767f32
src/main/python/apache/aurora/config/schema/base.py
 f12634f103c3eb20e43f37c25d9b0fc3e3d228ec
src/main/python/apache/aurora/executor/common/health_checker.py
 4980411c847d12655cbb363404707ebd9f0bd163
src/test/python/apache/aurora/executor/common/BUILD
 c7f7a003c865d479ba6e3cd7b5349322f884f653
src/test/python/apache/aurora/executor/common/test_health_checker.py
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec
 
  Diff: https://reviews.apache.org/r/26383/diff/
 
 
  Testing
  ---
 
  On vagrant in ~/aurora, I ran
  ./pants src/test/python/apache/aurora/executor::
 
 
  Thanks,
 
  David Pan
 
 




Re: Review Request 26383: Health Check Disabler

2014-10-08 Thread David Pan


 On Oct. 6, 2014, 10:55 p.m., Joshua Cohen wrote:
  src/main/python/apache/aurora/executor/common/health_checker.py, lines 
  157-158
  https://reviews.apache.org/r/26383/diff/1/?file=714260#file714260line157
 
  I know you didn't originate this pattern, but it seems odd for these 
  defaults to be repeated on both the HealthChecker and the 
  ThreadedHealthChecker. Perhaps they should be constants, or maybe 
  ThreadedHealthChecker shouldn't have defaults at all?

I actually did originate this pattern.  The reason why ThreadedHealthChecker 
has defaults is because some tests don't care about all the parameters to 
ThreadedHealthChecker.  For example, if my test doesn't care about the 
interval_secs, the defaults allow me to ignore that parameter altogether, and 
focus on the parameters that I actually care about.


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55590
---


On Oct. 6, 2014, 9:24 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 6, 2014, 9:24 p.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The path of the 
 snooze file and the snooze duration can be set in the HealthCheckConfig.  The 
 appropriate unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   docs/configuration-reference.md 5166d45ddf95ae5d8afe39dd3b00654ac91857ec 
   docs/configuration-tutorial.md 67998e9dab6ac429d96d7c0d2df959336b767f32 
   src/main/python/apache/aurora/config/schema/base.py 
 f12634f103c3eb20e43f37c25d9b0fc3e3d228ec 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-08 Thread David Pan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/
---

(Updated Oct. 9, 2014, 12:43 a.m.)


Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.


Changes
---

Removed configuration for snooze file and snooze duration.  Removed time 
control.  Addressed other comments.


Repository: aurora


Description
---

The health check disabler allows health checks for a job to be snoozed 
temporarily by touching a snooze file in the job's sandbox.  The path of the 
snooze file and the snooze duration can be set in the HealthCheckConfig.  The 
appropriate unit tests were modified/added.

The corresponding JIRA ticket is the following:
https://issues.apache.org/jira/browse/AURORA-795


Diffs (updated)
-

  src/main/python/apache/aurora/executor/common/health_checker.py 
4980411c847d12655cbb363404707ebd9f0bd163 
  src/test/python/apache/aurora/executor/common/BUILD 
c7f7a003c865d479ba6e3cd7b5349322f884f653 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
aa36415fa891fc523a3a376ffeca5d3cd5ceabec 

Diff: https://reviews.apache.org/r/26383/diff/


Testing
---

On vagrant in ~/aurora, I ran
./pants src/test/python/apache/aurora/executor::


Thanks,

David Pan



Re: Review Request 26383: Health Check Disabler

2014-10-08 Thread David Pan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/
---

(Updated Oct. 9, 2014, 12:46 a.m.)


Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.


Repository: aurora


Description (updated)
---

The health check disabler allows health checks for a job to be snoozed 
temporarily by touching a snooze file in the job's sandbox.  The appropriate 
unit tests were modified/added.

The corresponding JIRA ticket is the following:
https://issues.apache.org/jira/browse/AURORA-795


Diffs
-

  src/main/python/apache/aurora/executor/common/health_checker.py 
4980411c847d12655cbb363404707ebd9f0bd163 
  src/test/python/apache/aurora/executor/common/BUILD 
c7f7a003c865d479ba6e3cd7b5349322f884f653 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
aa36415fa891fc523a3a376ffeca5d3cd5ceabec 

Diff: https://reviews.apache.org/r/26383/diff/


Testing
---

On vagrant in ~/aurora, I ran
./pants src/test/python/apache/aurora/executor::


Thanks,

David Pan



Re: Review Request 26383: Health Check Disabler

2014-10-08 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55924
---



src/main/python/apache/aurora/executor/common/health_checker.py
https://reviews.apache.org/r/26383/#comment96295

Why add a default value here?



src/main/python/apache/aurora/executor/common/health_checker.py
https://reviews.apache.org/r/26383/#comment96296

Why add a default value here?



src/main/python/apache/aurora/executor/common/health_checker.py
https://reviews.apache.org/r/26383/#comment96297

Why add a default here?


- Zameer Manji


On Oct. 8, 2014, 5:46 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 8, 2014, 5:46 p.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-08 Thread Zameer Manji


 On Oct. 8, 2014, 6:09 p.m., Zameer Manji wrote:
  src/main/python/apache/aurora/executor/common/health_checker.py, line 41
  https://reviews.apache.org/r/26383/diff/2/?file=716291#file716291line41
 
  Why add a default value here?
 
 David Pan wrote:
 The reason why ThreadedHealthChecker has defaults is because some tests 
 for ThreadedHealthChecker don't care about all the parameters to 
 ThreadedHealthChecker.  For example, if my test doesn't care about the 
 interval_secs, the defaults allow me to ignore that parameter altogether, and 
 focus on the parameters that I actually care about.

It isn't acceptable to alter parameters like that. Please modify your tests to 
pass in a value instead.


- Zameer


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55924
---


On Oct. 8, 2014, 5:46 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 8, 2014, 5:46 p.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The appropriate 
 unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Re: Review Request 26383: Health Check Disabler

2014-10-08 Thread David Pan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/
---

(Updated Oct. 9, 2014, 1:56 a.m.)


Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.


Changes
---

Removed default parameters from ThreadedHealthChecker


Repository: aurora


Description
---

The health check disabler allows health checks for a job to be snoozed 
temporarily by touching a snooze file in the job's sandbox.  The appropriate 
unit tests were modified/added.

The corresponding JIRA ticket is the following:
https://issues.apache.org/jira/browse/AURORA-795


Diffs (updated)
-

  src/main/python/apache/aurora/executor/common/health_checker.py 
4980411c847d12655cbb363404707ebd9f0bd163 
  src/test/python/apache/aurora/executor/common/BUILD 
c7f7a003c865d479ba6e3cd7b5349322f884f653 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
aa36415fa891fc523a3a376ffeca5d3cd5ceabec 

Diff: https://reviews.apache.org/r/26383/diff/


Testing
---

On vagrant in ~/aurora, I ran
./pants src/test/python/apache/aurora/executor::


Thanks,

David Pan



Re: Review Request 26383: Health Check Disabler

2014-10-07 Thread Brian Wickman


 On Oct. 6, 2014, 10:40 p.m., Brian Wickman wrote:
  docs/configuration-reference.md, lines 359-360
  https://reviews.apache.org/r/26383/diff/1/?file=714257#file714257line359
 
  Is there any reason this needs to be configurable?  Why not just 
  hardcode the filename as '.healthchecksnooze' and then allow the user to 
  specify the snooze at runtime, e.g. echo '600'  .healthchecksnooze to 
  sleep for 600 seconds.  (And if the value is malformed, just unlink and 
  don't snooze.)
 
 David Pan wrote:
 A few questions:
 1. In the case if the value is malformed, do we want to force fail the 
 health check in order to alert the user the fact that they failed at setting 
 the snooze duration.  Otherwise, the user might not notice the mistake.
 2. Should we read the value from the file only the first time the snooze 
 file is created.  Or, should we allow the user change the value on the fly.

1. I think force failing the health check would be counterproductive -- if you 
accidentally fat finger, you don't want to kill the task which might already be 
a unicorn.  Instead just unlinking right away or perhaps using a default snooze 
value.

2. Maybe do mtime + time in the file as the snooze expiration, and re-read if 
the mtime changed.  This means at least doing a stat() each health check 
interval, but will allow you to change the snooze on the fly.


- Brian


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55592
---


On Oct. 6, 2014, 9:24 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 6, 2014, 9:24 p.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The path of the 
 snooze file and the snooze duration can be set in the HealthCheckConfig.  The 
 appropriate unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   docs/configuration-reference.md 5166d45ddf95ae5d8afe39dd3b00654ac91857ec 
   docs/configuration-tutorial.md 67998e9dab6ac429d96d7c0d2df959336b767f32 
   src/main/python/apache/aurora/config/schema/base.py 
 f12634f103c3eb20e43f37c25d9b0fc3e3d228ec 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan
 




Review Request 26383: Health Check Disabler

2014-10-06 Thread David Pan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/
---

Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.


Repository: aurora


Description
---

The health check disabler allows health checks for a job to be snoozed 
temporarily by touching a snooze file in the job's sandbox.  The path of the 
snooze file and the snooze duration can be set in the HealthCheckConfig.  The 
appropriate unit tests were modified/added.

The corresponding JIRA ticket is the following:
https://issues.apache.org/jira/browse/AURORA-795


Diffs
-

  docs/configuration-reference.md 5166d45ddf95ae5d8afe39dd3b00654ac91857ec 
  docs/configuration-tutorial.md 67998e9dab6ac429d96d7c0d2df959336b767f32 
  src/main/python/apache/aurora/config/schema/base.py 
f12634f103c3eb20e43f37c25d9b0fc3e3d228ec 
  src/main/python/apache/aurora/executor/common/health_checker.py 
4980411c847d12655cbb363404707ebd9f0bd163 
  src/test/python/apache/aurora/executor/common/BUILD 
c7f7a003c865d479ba6e3cd7b5349322f884f653 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
aa36415fa891fc523a3a376ffeca5d3cd5ceabec 

Diff: https://reviews.apache.org/r/26383/diff/


Testing
---

On vagrant in ~/aurora, I ran
./pants src/test/python/apache/aurora/executor::


Thanks,

David Pan



Re: Review Request 26383: Health Check Disabler

2014-10-06 Thread Brian Wickman

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26383/#review55592
---



docs/configuration-reference.md
https://reviews.apache.org/r/26383/#comment95956

Is there any reason this needs to be configurable?  Why not just hardcode 
the filename as '.healthchecksnooze' and then allow the user to specify the 
snooze at runtime, e.g. echo '600'  .healthchecksnooze to sleep for 600 
seconds.  (And if the value is malformed, just unlink and don't snooze.)


- Brian Wickman


On Oct. 6, 2014, 9:24 p.m., David Pan wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/26383/
 ---
 
 (Updated Oct. 6, 2014, 9:24 p.m.)
 
 
 Review request for Aurora, Joe Smith, Brian Wickman, and Zameer Manji.
 
 
 Repository: aurora
 
 
 Description
 ---
 
 The health check disabler allows health checks for a job to be snoozed 
 temporarily by touching a snooze file in the job's sandbox.  The path of the 
 snooze file and the snooze duration can be set in the HealthCheckConfig.  The 
 appropriate unit tests were modified/added.
 
 The corresponding JIRA ticket is the following:
 https://issues.apache.org/jira/browse/AURORA-795
 
 
 Diffs
 -
 
   docs/configuration-reference.md 5166d45ddf95ae5d8afe39dd3b00654ac91857ec 
   docs/configuration-tutorial.md 67998e9dab6ac429d96d7c0d2df959336b767f32 
   src/main/python/apache/aurora/config/schema/base.py 
 f12634f103c3eb20e43f37c25d9b0fc3e3d228ec 
   src/main/python/apache/aurora/executor/common/health_checker.py 
 4980411c847d12655cbb363404707ebd9f0bd163 
   src/test/python/apache/aurora/executor/common/BUILD 
 c7f7a003c865d479ba6e3cd7b5349322f884f653 
   src/test/python/apache/aurora/executor/common/test_health_checker.py 
 aa36415fa891fc523a3a376ffeca5d3cd5ceabec 
 
 Diff: https://reviews.apache.org/r/26383/diff/
 
 
 Testing
 ---
 
 On vagrant in ~/aurora, I ran
 ./pants src/test/python/apache/aurora/executor::
 
 
 Thanks,
 
 David Pan