Re: Review Request 67639: Export count-down to forceful Maintenace as a metric.

2018-06-18 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67639/#review204974
---



Master (9c9b592) is green with this patch.
  ./build-support/jenkins/build.sh

However, it appears that it might lack test coverage.

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On June 19, 2018, 1:21 a.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67639/
> ---
> 
> (Updated June 19, 2018, 1:21 a.m.)
> 
> 
> Review request for Aurora, Franck Cuny and Jordan Ly.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Since the scheduler enforces a maximum timeout on each
> maintenance request and we now allow CoordinatorSlaPolicy
> to block maintenance, we need to know which tasks are
> running into the force maintenance timeout. Export maintenace
> count down time as a metric brokwen down by task keys.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/base/InstanceKeys.java 
> b12ac83168401c15fb1d30179ea8e4816f09cd3d 
>   
> src/main/java/org/apache/aurora/scheduler/maintenance/MaintenanceController.java
>  7fc5990dfb04c5528a44142c3efdd6d60d08188d 
> 
> 
> Diff: https://reviews.apache.org/r/67639/diff/1/
> 
> 
> Testing
> ---
> 
> ./gradlew test
> 
> **Tested in Vagrant**
> sshanmugham::tw-mbp-sshanmugham {~}$ curl http://192.168.33.7:8081/vars | 
> grep maintenance_countdown
>  
> 100.0%
> maintenance_countdown_ms_vagrant/test/coordinator/0 264523
> maintenance_countdown_ms_vagrant/test/coordinator/1 24476
> sshanmugham::tw-mbp-sshanmugham {~}$ curl http://192.168.33.7:8081/vars | 
> grep maintenance_countdown
>  
> 100.0%
> maintenance_countdown_ms_vagrant/test/coordinator/0 264523
> maintenance_countdown_ms_vagrant/test/coordinator/1 24476
> sshanmugham::tw-mbp-sshanmugham {~}$ curl http://192.168.33.7:8081/vars | 
> grep maintenance_countdown
>  
> 100.0%
> maintenance_countdown_ms_vagrant/test/coordinator/0 264523
> maintenance_countdown_ms_vagrant/test/coordinator/1 0
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Review Request 67639: Export count-down to forceful Maintenace as a metric.

2018-06-18 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67639/
---

Review request for Aurora, Franck Cuny and Jordan Ly.


Repository: aurora


Description
---

Since the scheduler enforces a maximum timeout on each
maintenance request and we now allow CoordinatorSlaPolicy
to block maintenance, we need to know which tasks are
running into the force maintenance timeout. Export maintenace
count down time as a metric brokwen down by task keys.


Diffs
-

  src/main/java/org/apache/aurora/scheduler/base/InstanceKeys.java 
b12ac83168401c15fb1d30179ea8e4816f09cd3d 
  
src/main/java/org/apache/aurora/scheduler/maintenance/MaintenanceController.java
 7fc5990dfb04c5528a44142c3efdd6d60d08188d 


Diff: https://reviews.apache.org/r/67639/diff/1/


Testing
---

./gradlew test

**Tested in Vagrant**
sshanmugham::tw-mbp-sshanmugham {~}$ curl http://192.168.33.7:8081/vars | grep 
maintenance_countdown
 100.0%
maintenance_countdown_ms_vagrant/test/coordinator/0 264523
maintenance_countdown_ms_vagrant/test/coordinator/1 24476
sshanmugham::tw-mbp-sshanmugham {~}$ curl http://192.168.33.7:8081/vars | grep 
maintenance_countdown
 100.0%
maintenance_countdown_ms_vagrant/test/coordinator/0 264523
maintenance_countdown_ms_vagrant/test/coordinator/1 24476
sshanmugham::tw-mbp-sshanmugham {~}$ curl http://192.168.33.7:8081/vars | grep 
maintenance_countdown
 100.0%
maintenance_countdown_ms_vagrant/test/coordinator/0 264523
maintenance_countdown_ms_vagrant/test/coordinator/1 0


Thanks,

Santhosh Kumar Shanmugham



Re: Review Request 67638: Export number of tasks lost per dedicated role.

2018-06-18 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67638/#review204966
---


Ship it!




Master (4719fa7) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On June 18, 2018, 11:20 p.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67638/
> ---
> 
> (Updated June 18, 2018, 11:20 p.m.)
> 
> 
> Review request for Aurora, Franck Cuny and Jordan Ly.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Export number of tasks lost per dedicated role.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/TaskVars.java 
> ee20ed3ad7c17bd4ca11239a467113fc8a9e8f00 
>   src/test/java/org/apache/aurora/scheduler/TaskVarsTest.java 
> 6321ec068bd16737f96b39d5fdd8db25f3dea15c 
> 
> 
> Diff: https://reviews.apache.org/r/67638/diff/1/
> 
> 
> Testing
> ---
> 
> ./gradlew test
> 
> **Tested on Vagrant**
> tasks_lost_dedicatedweb.multi 0
> tasks_lost_dedicated_vagrant 2
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 67638: Export number of tasks lost per dedicated role.

2018-06-18 Thread Jordan Ly

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67638/#review204965
---


Ship it!




Ship It!

- Jordan Ly


On June 18, 2018, 11:20 p.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67638/
> ---
> 
> (Updated June 18, 2018, 11:20 p.m.)
> 
> 
> Review request for Aurora, Franck Cuny and Jordan Ly.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Export number of tasks lost per dedicated role.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/TaskVars.java 
> ee20ed3ad7c17bd4ca11239a467113fc8a9e8f00 
>   src/test/java/org/apache/aurora/scheduler/TaskVarsTest.java 
> 6321ec068bd16737f96b39d5fdd8db25f3dea15c 
> 
> 
> Diff: https://reviews.apache.org/r/67638/diff/1/
> 
> 
> Testing
> ---
> 
> ./gradlew test
> 
> **Tested on Vagrant**
> tasks_lost_dedicatedweb.multi 0
> tasks_lost_dedicated_vagrant 2
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 67638: Export number of tasks lost per dedicated role.

2018-06-18 Thread Franck Cuny via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67638/#review204964
---


Ship it!




Ship It!

- Franck Cuny


On June 18, 2018, 11:20 p.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67638/
> ---
> 
> (Updated June 18, 2018, 11:20 p.m.)
> 
> 
> Review request for Aurora, Franck Cuny and Jordan Ly.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Export number of tasks lost per dedicated role.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/TaskVars.java 
> ee20ed3ad7c17bd4ca11239a467113fc8a9e8f00 
>   src/test/java/org/apache/aurora/scheduler/TaskVarsTest.java 
> 6321ec068bd16737f96b39d5fdd8db25f3dea15c 
> 
> 
> Diff: https://reviews.apache.org/r/67638/diff/1/
> 
> 
> Testing
> ---
> 
> ./gradlew test
> 
> **Tested on Vagrant**
> tasks_lost_dedicatedweb.multi 0
> tasks_lost_dedicated_vagrant 2
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Review Request 67638: Export number of tasks lost per dedicated role.

2018-06-18 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67638/
---

Review request for Aurora, Franck Cuny and Jordan Ly.


Repository: aurora


Description
---

Export number of tasks lost per dedicated role.


Diffs
-

  src/main/java/org/apache/aurora/scheduler/TaskVars.java 
ee20ed3ad7c17bd4ca11239a467113fc8a9e8f00 
  src/test/java/org/apache/aurora/scheduler/TaskVarsTest.java 
6321ec068bd16737f96b39d5fdd8db25f3dea15c 


Diff: https://reviews.apache.org/r/67638/diff/1/


Testing
---

./gradlew test

**Tested on Vagrant**
tasks_lost_dedicatedweb.multi 0
tasks_lost_dedicated_vagrant 2


Thanks,

Santhosh Kumar Shanmugham



Re: Review Request 67627: Add observer flag to disable resource metric collection

2018-06-18 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67627/#review204936
---



Mostly LGTM.

Will the UI show 0s or empty spaces?

Can you expand on why PID namespaces breaks metrics?


docs/reference/observer-configuration.md
Lines 27 (patched)


also disk metrics



src/main/python/apache/aurora/tools/thermos_observer.py
Lines 68 (patched)


also disk metrics


- Santhosh Kumar Shanmugham


On June 18, 2018, 1:57 a.m., Stephan Erb wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67627/
> ---
> 
> (Updated June 18, 2018, 1:57 a.m.)
> 
> 
> Review request for Aurora, Renan DelValle, Reza Motamedi, and Santhosh Kumar 
> Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Add observer command line option `--disable_task_resource_collection` to
> disable the collection of CPU, memory, and disk metrics for observed tasks.
> This is useful in setups where metrics cannot be gathered reliable (e.g. when
> using PID namespaces) or when it is expensive due to hundreds of active tasks
> per host.
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md edc081f502370190597ad028f3275cdfd572f5ca 
>   docs/reference/observer-configuration.md 
> c791b3480e5bf35e6eb0fbea908ff3242eab315d 
>   src/main/python/apache/aurora/config/BUILD 
> 12e7fe973f456d0847ce63d3b293131a7f4c3bdd 
>   src/main/python/apache/aurora/tools/thermos_observer.py 
> fd9465d2e2b3135f3fdf8230777117adaa89337c 
>   src/main/python/apache/thermos/monitoring/resource.py 
> 72ed4e5a82dfd8a09e0a8262f6da4992ac98542a 
>   src/main/python/apache/thermos/observer/task_observer.py 
> 94cd6c541bb7f8a4c153cc51caa63d2c0a49 
>   src/test/python/apache/thermos/monitoring/test_resource.py 
> 44450647a180f86903ebd37f2a9f4327496597e9 
> 
> 
> Diff: https://reviews.apache.org/r/67627/diff/1/
> 
> 
> Testing
> ---
> 
> We are running our Mesos agents with enabled PID namespaces (i.e.
> `--isolation='namespaces/ipc,namespaces/pid,...'`). Sometimes the hosts are
> also tightly packed with many small tasks (e.g. `~130` active tasks and 
> `~1000`
> finished tasks). Even with very relaxed scrape settings of 
> `--task_process_collection_interval_secs=3000` and
> `--task_disk_collection_interval_secs=3000` it can take between `150ms-2500ms`
> to render the observer landing page `/main`. This patch reduces this to about
> `100ms-150ms`. There is no immediate downside as metrics reporting is broken
> anyway due to the PID namespacing.
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>



Re: Review Request 67627: Add observer flag to disable resource metric collection

2018-06-18 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67627/#review204916
---


Ship it!




Master (4719fa7) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On June 18, 2018, 8:57 a.m., Stephan Erb wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67627/
> ---
> 
> (Updated June 18, 2018, 8:57 a.m.)
> 
> 
> Review request for Aurora, Renan DelValle, Reza Motamedi, and Santhosh Kumar 
> Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Add observer command line option `--disable_task_resource_collection` to
> disable the collection of CPU, memory, and disk metrics for observed tasks.
> This is useful in setups where metrics cannot be gathered reliable (e.g. when
> using PID namespaces) or when it is expensive due to hundreds of active tasks
> per host.
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md edc081f502370190597ad028f3275cdfd572f5ca 
>   docs/reference/observer-configuration.md 
> c791b3480e5bf35e6eb0fbea908ff3242eab315d 
>   src/main/python/apache/aurora/config/BUILD 
> 12e7fe973f456d0847ce63d3b293131a7f4c3bdd 
>   src/main/python/apache/aurora/tools/thermos_observer.py 
> fd9465d2e2b3135f3fdf8230777117adaa89337c 
>   src/main/python/apache/thermos/monitoring/resource.py 
> 72ed4e5a82dfd8a09e0a8262f6da4992ac98542a 
>   src/main/python/apache/thermos/observer/task_observer.py 
> 94cd6c541bb7f8a4c153cc51caa63d2c0a49 
>   src/test/python/apache/thermos/monitoring/test_resource.py 
> 44450647a180f86903ebd37f2a9f4327496597e9 
> 
> 
> Diff: https://reviews.apache.org/r/67627/diff/1/
> 
> 
> Testing
> ---
> 
> We are running our Mesos agents with enabled PID namespaces (i.e.
> `--isolation='namespaces/ipc,namespaces/pid,...'`). Sometimes the hosts are
> also tightly packed with many small tasks (e.g. `~130` active tasks and 
> `~1000`
> finished tasks). Even with very relaxed scrape settings of 
> `--task_process_collection_interval_secs=3000` and
> `--task_disk_collection_interval_secs=3000` it can take between `150ms-2500ms`
> to render the observer landing page `/main`. This patch reduces this to about
> `100ms-150ms`. There is no immediate downside as metrics reporting is broken
> anyway due to the PID namespacing.
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>