Re: Review Request 66103: Introduce mesos disk collector

2018-03-19 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199521
---



This patch does not apply cleanly against master (b3fa9fe), do you need to 
rebase?

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On March 20, 2018, 5:37 a.m., Reza Motamedi wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> ---
> 
> (Updated March 20, 2018, 5:37 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, 
> Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage 
> for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, 
> essentially repeating the work already done by the agent. In practice, we see 
> that disk monitoring is one of the most expensive resource monitoring tasks. 
> For instance, when there are deeply nested directories, the CPU utilization 
> of the observer process can easily reach 1.5 CPUs. It would be ideal if we 
> delegate the disk monitoring task to the agent and do it only once. With this 
> approach, when disk collection has improved in the agent (for instance by 
> implementing XFS isolation), we can simply benefit from it without any code 
> change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API 
> endpoint to lookup disk_used_bytes. Note that there is also resource 
> monitoring in thermos executor. Currently, I left the disk collector there to 
> use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and 
> `etc_mesos-slave/isolation` for testing. They can be left as is. I included 
> them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 
> 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/aurora-executor.service 
> 5a1a9082ecd7b1367ec677d760a5c375b6db9076 
>   src/main/python/apache/aurora/tools/thermos_observer.py 
> dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 
> 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 
> 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/resource.py 
> f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/task_observer.py 
> 4bb5d239e81fe4659397f899760c0e8853e93786 
>   
> src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py
>  fe74bd1d3ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 
> 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 
> 362393bfd1facf3198e2d438d0596b16700b72b8 
>   src/test/python/apache/thermos/monitoring/test_resource.py 
> e577e552d4ee1807096a15401851bb9fd95fa426 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/2/
> 
> 
> Testing
> ---
> 
> I added unit tests.
> Tested in vagrant and it works as intenced.
> I also built and deployed in our test enviroment. In order to measure 
> imporoved performance I created jobs with nested folders and noticed 
> reduction in CPU utilization of the Observer process, by at least 60%. (1.5 
> CPU cores to 0.4 CPU cores)
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>



Re: Review Request 66103: Introduce mesos disk collector

2018-03-19 Thread Reza Motamedi


> On March 19, 2018, 9 p.m., Santhosh Kumar Shanmugham wrote:
> > 3rdparty/python/requirements.txt
> > Lines 23 (patched)
> > 
> >
> > Any reason not using the more widely used `jq`?

There are two python libraries for jq 
1) https://pypi.python.org/pypi/jq
2) https://pypi.python.org/pypi/pyjq

These two libs have not be updataed recently. We also don't need all the 
functions of jq. I am open to suggestions. If any of the above or another lib 
is preferred.


> On March 19, 2018, 9 p.m., Santhosh Kumar Shanmugham wrote:
> > src/main/python/apache/thermos/monitoring/resource.py
> > Line 158 (original), 159 (patched)
> > 
> >
> > Call this `disk_collector_class`? It reads a little wierd when we call 
> > this `disk_collector`, meaning it is the actual object to be used.

I agree. Addressed.


> On March 19, 2018, 9 p.m., Santhosh Kumar Shanmugham wrote:
> > src/main/python/apache/thermos/monitoring/resource.py
> > Lines 164 (patched)
> > 
> >
> > Can we combine this via partial function to the `disk_collector_class` 
> > argument? This will keep the constructor more idiomatic.

I added `DiskCollectorProvider` to address this.


- Reza


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199448
---


On March 20, 2018, 5:37 a.m., Reza Motamedi wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> ---
> 
> (Updated March 20, 2018, 5:37 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, 
> Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage 
> for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, 
> essentially repeating the work already done by the agent. In practice, we see 
> that disk monitoring is one of the most expensive resource monitoring tasks. 
> For instance, when there are deeply nested directories, the CPU utilization 
> of the observer process can easily reach 1.5 CPUs. It would be ideal if we 
> delegate the disk monitoring task to the agent and do it only once. With this 
> approach, when disk collection has improved in the agent (for instance by 
> implementing XFS isolation), we can simply benefit from it without any code 
> change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API 
> endpoint to lookup disk_used_bytes. Note that there is also resource 
> monitoring in thermos executor. Currently, I left the disk collector there to 
> use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and 
> `etc_mesos-slave/isolation` for testing. They can be left as is. I included 
> them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 
> 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/aurora-executor.service 
> 5a1a9082ecd7b1367ec677d760a5c375b6db9076 
>   src/main/python/apache/aurora/tools/thermos_observer.py 
> dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 
> 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 
> 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/resource.py 
> f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/task_observer.py 
> 4bb5d239e81fe4659397f899760c0e8853e93786 
>   
> src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py
>  fe74bd1d3ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 
> 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 
> 362393bfd1facf3198e2d438d0596b16700b72b8 
>   src/test/python/apache/thermos/monitoring/test_resource.py 
> e577e552d4ee1807096a15401851bb9fd95fa426 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/2/
> 
> 
> Testing
> ---
> 
> I added unit tests.
> Tested in vagrant and it works as intenced.
> I also built and deployed in our 

Re: Review Request 66103: Introduce mesos disk collector

2018-03-19 Thread Reza Motamedi


> On March 19, 2018, 11:30 p.m., Kai Huang wrote:
> > src/main/python/apache/thermos/monitoring/disk.py
> > Lines 96 (patched)
> > 
> >
> > Just curious: is there any reason we set this value to -1GB?

It is just a magic number. Although it seems like the observer shows negative 
numbers as `-0.0GB` so I guess I can only set that to `-1`. The reason to show 
a negative number is to show the user that there is something worong with the 
setup.


- Reza


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199486
---


On March 20, 2018, 5:37 a.m., Reza Motamedi wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> ---
> 
> (Updated March 20, 2018, 5:37 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, 
> Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage 
> for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, 
> essentially repeating the work already done by the agent. In practice, we see 
> that disk monitoring is one of the most expensive resource monitoring tasks. 
> For instance, when there are deeply nested directories, the CPU utilization 
> of the observer process can easily reach 1.5 CPUs. It would be ideal if we 
> delegate the disk monitoring task to the agent and do it only once. With this 
> approach, when disk collection has improved in the agent (for instance by 
> implementing XFS isolation), we can simply benefit from it without any code 
> change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API 
> endpoint to lookup disk_used_bytes. Note that there is also resource 
> monitoring in thermos executor. Currently, I left the disk collector there to 
> use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and 
> `etc_mesos-slave/isolation` for testing. They can be left as is. I included 
> them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 
> 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/aurora-executor.service 
> 5a1a9082ecd7b1367ec677d760a5c375b6db9076 
>   src/main/python/apache/aurora/tools/thermos_observer.py 
> dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 
> 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 
> 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/resource.py 
> f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/task_observer.py 
> 4bb5d239e81fe4659397f899760c0e8853e93786 
>   
> src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py
>  fe74bd1d3ecd89fca1b5b2251202cbbc0f24 
>   src/test/python/apache/thermos/monitoring/BUILD 
> 8f2b39336dce6c7b580e6ba0009f60afdcb89179 
>   src/test/python/apache/thermos/monitoring/test_disk.py 
> 362393bfd1facf3198e2d438d0596b16700b72b8 
>   src/test/python/apache/thermos/monitoring/test_resource.py 
> e577e552d4ee1807096a15401851bb9fd95fa426 
> 
> 
> Diff: https://reviews.apache.org/r/66103/diff/2/
> 
> 
> Testing
> ---
> 
> I added unit tests.
> Tested in vagrant and it works as intenced.
> I also built and deployed in our test enviroment. In order to measure 
> imporoved performance I created jobs with nested folders and noticed 
> reduction in CPU utilization of the Observer process, by at least 60%. (1.5 
> CPU cores to 0.4 CPU cores)
> 
> 
> Thanks,
> 
> Reza Motamedi
> 
>



Re: Review Request 66103: Introduce mesos disk collector

2018-03-19 Thread Reza Motamedi

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/
---

(Updated March 20, 2018, 5:37 a.m.)


Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, 
Santhosh Kumar Shanmugham, and Stephan Erb.


Changes
---

- Address feedback
- push `DISK_COLLECTION_INTERVAL` into `DiskCollectorSettings`
- Introduce `DiskCollectorProvider` to encapsulate the logic for selecting the 
right disk collection implementation.


Repository: aurora


Description
---

When disk isolation is enabled in a Mesos agent it calculates the disk usage 
for each container. 
Thermos Observer also monitors disk usage using `twitter.common.dirutil`, 
essentially repeating the work already done by the agent. In practice, we see 
that disk monitoring is one of the most expensive resource monitoring tasks. 
For instance, when there are deeply nested directories, the CPU utilization of 
the observer process can easily reach 1.5 CPUs. It would be ideal if we 
delegate the disk monitoring task to the agent and do it only once. With this 
approach, when disk collection has improved in the agent (for instance by 
implementing XFS isolation), we can simply benefit from it without any code 
change. Some more information about the problem is provided in AURORA-1918.

This patch that introduces `MesosDiskCollector` which queries the agent's API 
endpoint to lookup disk_used_bytes. Note that there is also resource monitoring 
in thermos executor. Currently, I left the disk collector there to use the `du` 
implementation. That can be changed in a later patch.

I modified some vagrant config files including `aurora-executor.service` and 
`etc_mesos-slave/isolation` for testing. They can be left as is. I included 
them in this patch to show how this would work e2e.


Diffs (updated)
-

  3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
  examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
  examples/vagrant/mesos_config/etc_mesos-slave/isolation 
1a7028ffc70116b104ef3ad22b7388f637707a0f 
  examples/vagrant/systemd/aurora-executor.service 
5a1a9082ecd7b1367ec677d760a5c375b6db9076 
  src/main/python/apache/aurora/tools/thermos_observer.py 
dd9f0c46ceac9e939b1b763073314161de0ea614 
  src/main/python/apache/thermos/monitoring/BUILD 
65ba7088f65e7baa5d30744736ba456b46a55e86 
  src/main/python/apache/thermos/monitoring/disk.py 
52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
  src/main/python/apache/thermos/monitoring/resource.py 
f5e3849ca6682c6d4720698be869ca6b9f703b94 
  src/main/python/apache/thermos/observer/task_observer.py 
4bb5d239e81fe4659397f899760c0e8853e93786 
  
src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py
 fe74bd1d3ecd89fca1b5b2251202cbbc0f24 
  src/test/python/apache/thermos/monitoring/BUILD 
8f2b39336dce6c7b580e6ba0009f60afdcb89179 
  src/test/python/apache/thermos/monitoring/test_disk.py 
362393bfd1facf3198e2d438d0596b16700b72b8 
  src/test/python/apache/thermos/monitoring/test_resource.py 
e577e552d4ee1807096a15401851bb9fd95fa426 


Diff: https://reviews.apache.org/r/66103/diff/2/

Changes: https://reviews.apache.org/r/66103/diff/1-2/


Testing
---

I added unit tests.
Tested in vagrant and it works as intenced.
I also built and deployed in our test enviroment. In order to measure imporoved 
performance I created jobs with nested folders and noticed reduction in CPU 
utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU 
cores)


Thanks,

Reza Motamedi



Re: Review Request 66154: Adding support for using custom executors via the Aurora DSL

2018-03-19 Thread David McLaughlin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66154/#review199504
---



Not sure how possible it is given the requirement for a completely custom 
executor, but e2e coverage would be great.

- David McLaughlin


On March 19, 2018, 11:33 p.m., Renan DelValle wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66154/
> ---
> 
> (Updated March 19, 2018, 11:33 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan 
> Erb.
> 
> 
> Bugs: AURORA-1981
> https://issues.apache.org/jira/browse/AURORA-1981
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Adding executor_config as a field to MesosJob in order to allow the use of 
> custom executors through the DSL.
> 
> First patch in a series to catch up on some work I have backlogged regarding 
> custom executors.
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md b5d06c42badd3b623ad365a5d34a7162e8c782f2 
>   docs/features/custom-executors.md 1357c1ed334dbd90ac9937c557766ce31686c293 
>   docs/reference/configuration.md 725e073e8057f0e5e17f50285d86c89638c96d16 
>   src/main/python/apache/aurora/config/schema/base.py 
> 3d57d6a68b9553938c8ef8a15e6ac88d466f7c3c 
>   src/main/python/apache/aurora/config/thrift.py 
> dcabb03500c8e60d5585f0e65c74e163bd93e43a 
>   src/test/python/apache/aurora/client/cli/test_inspect.py 
> 8c624800eed383383dfb07c10aae103c54875728 
>   src/test/python/apache/aurora/config/test_thrift.py 
> 7bf050836633780401c5d8a371e006bde025c420 
> 
> 
> Diff: https://reviews.apache.org/r/66154/diff/1/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Renan DelValle
> 
>



Re: Review Request 66154: Adding support for using custom executors via the Aurora DSL

2018-03-19 Thread David McLaughlin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66154/#review199503
---


Ship it!




Ship It!

- David McLaughlin


On March 19, 2018, 11:33 p.m., Renan DelValle wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66154/
> ---
> 
> (Updated March 19, 2018, 11:33 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan 
> Erb.
> 
> 
> Bugs: AURORA-1981
> https://issues.apache.org/jira/browse/AURORA-1981
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Adding executor_config as a field to MesosJob in order to allow the use of 
> custom executors through the DSL.
> 
> First patch in a series to catch up on some work I have backlogged regarding 
> custom executors.
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md b5d06c42badd3b623ad365a5d34a7162e8c782f2 
>   docs/features/custom-executors.md 1357c1ed334dbd90ac9937c557766ce31686c293 
>   docs/reference/configuration.md 725e073e8057f0e5e17f50285d86c89638c96d16 
>   src/main/python/apache/aurora/config/schema/base.py 
> 3d57d6a68b9553938c8ef8a15e6ac88d466f7c3c 
>   src/main/python/apache/aurora/config/thrift.py 
> dcabb03500c8e60d5585f0e65c74e163bd93e43a 
>   src/test/python/apache/aurora/client/cli/test_inspect.py 
> 8c624800eed383383dfb07c10aae103c54875728 
>   src/test/python/apache/aurora/config/test_thrift.py 
> 7bf050836633780401c5d8a371e006bde025c420 
> 
> 
> Diff: https://reviews.apache.org/r/66154/diff/1/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Renan DelValle
> 
>



Re: Review Request 66139: Speedup regular Thermos observer checkpoint refresh

2018-03-19 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66139/#review199501
---


Ship it!




Master (b3fa9fe) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On March 19, 2018, 11:22 p.m., Stephan Erb wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66139/
> ---
> 
> (Updated March 19, 2018, 11:22 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Renan DelValle, and Reza Motamedi.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Profiling indicated that roughly 70% of the refresh time was spend in 
> `os.path.realpath`. 
> This was introduced in https://reviews.apache.org/r/35580/ to properly handle 
> the `latest`
> symlink in the Mesos folder layout. 
> 
> This patch takes a slightly different approach to solve this problem based on 
> `os.path.islink`.
> The latter is faster as it just needs to look at a single folder rather than 
> an entire path.
> 
> To facilitate this optimization the patch removes unused functionality from 
> ExecutorDetector.
> Most probably this  was used by former GC executor that got removed a few 
> years ago.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/BUILD 
> 486230db34a22ea5dd0f68da911c0afb1afbcac0 
>   src/main/python/apache/aurora/executor/common/executor_detector.py 
> a07bfc34caa5f86153ace8184b061e253c39e92e 
>   src/main/python/apache/aurora/executor/common/path_detector.py 
> ed264d74ef5a5a7aa681a56b340f9b16504a88ad 
>   src/main/python/apache/thermos/monitoring/detector.py 
> 6e5a620e9f49e0960a814f4c1f701a21cc7678fd 
>   src/test/python/apache/aurora/executor/common/test_executor_detector.py 
> d2a948f2e95cbc1723b63462787c4256cc56731d 
>   src/test/python/apache/aurora/executor/common/test_path_detector.py 
> 7b5ef0cf552d22d4cfbf3357071de036551026dc 
> 
> 
> Diff: https://reviews.apache.org/r/66139/diff/1/
> 
> 
> Testing
> ---
> 
> I have tested this build on a node with 75 running tasks and 4500 finished 
> ones.
> 
> Before this patch:
> 
> D0319 09:05:07.219057 14514 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 2.26s
> D0319 09:05:14.412610 14514 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 2.39s
> D0319 09:05:21.600985 14514 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 2.29s
>  
> With this patch:
> 
> D0319 09:00:24.084657 5427 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 0.77s
> D0319 09:00:29.869699 5427 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 0.78s
> D0319 09:00:35.643407 5427 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 0.77s
> 
> The regular 2-3 second freezes when navigating the Thermos UI are now almost 
> gone for me.
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>



Review Request 66154: Adding support for using custom executors via the Aurora DSL

2018-03-19 Thread Renan DelValle

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66154/
---

Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Stephan 
Erb.


Bugs: AURORA-1981
https://issues.apache.org/jira/browse/AURORA-1981


Repository: aurora


Description
---

Adding executor_config as a field to MesosJob in order to allow the use of 
custom executors through the DSL.

First patch in a series to catch up on some work I have backlogged regarding 
custom executors.


Diffs
-

  RELEASE-NOTES.md b5d06c42badd3b623ad365a5d34a7162e8c782f2 
  docs/features/custom-executors.md 1357c1ed334dbd90ac9937c557766ce31686c293 
  docs/reference/configuration.md 725e073e8057f0e5e17f50285d86c89638c96d16 
  src/main/python/apache/aurora/config/schema/base.py 
3d57d6a68b9553938c8ef8a15e6ac88d466f7c3c 
  src/main/python/apache/aurora/config/thrift.py 
dcabb03500c8e60d5585f0e65c74e163bd93e43a 
  src/test/python/apache/aurora/client/cli/test_inspect.py 
8c624800eed383383dfb07c10aae103c54875728 
  src/test/python/apache/aurora/config/test_thrift.py 
7bf050836633780401c5d8a371e006bde025c420 


Diff: https://reviews.apache.org/r/66154/diff/1/


Testing
---

./build-support/jenkins/build.sh

./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh


Thanks,

Renan DelValle



Re: Review Request 66136: Switch Thermos to lazy log formatting

2018-03-19 Thread Reza Motamedi

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66136/#review199491
---


Ship it!




Ship It!

- Reza Motamedi


On March 19, 2018, 2:55 p.m., Stephan Erb wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66136/
> ---
> 
> (Updated March 19, 2018, 2:55 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Renan DelValle.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first part of a small series of Thermos observer performance 
> improvements.
> 
> As a first iteration, this switches all logging to use the logger-embedded 
> formatting
> rather than doing it eager up front. This has the advantage that we produce 
> less garbage
> if debug logging is disabled.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/thermos/common/ckpt.py 
> e79ec6ac42483bfab4ccd94390d0f2af95903a95 
>   src/main/python/apache/thermos/core/helper.py 
> 0811e846ce8e39a22d55732604876c37d2caa239 
>   src/main/python/apache/thermos/core/muxer.py 
> 47e77f7685891545b2526e71e016ffcd454f938c 
>   src/main/python/apache/thermos/core/process.py 
> 4a4678ff39c84cb87836aca19365c5b2aabc4fa4 
>   src/main/python/apache/thermos/core/runner.py 
> 1b63c083246f8ec1a43cb1686968c2b1bb4a14b5 
>   src/main/python/apache/thermos/monitoring/disk.py 
> 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/monitor.py 
> d77703e9027e6c23ffb67d936cf8beef0384b4a6 
>   src/main/python/apache/thermos/monitoring/process_collector_psutil.py 
> 3000e95e2930a01f57ac855960b5db7aabbfc0ca 
>   src/main/python/apache/thermos/monitoring/resource.py 
> f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/http/file_browser.py 
> d0996651312a763ed68a68ae358524cfb4a94d29 
>   src/main/python/apache/thermos/observer/http/http_observer.py 
> 5bfc4f28bfbbcb654a059a3579844f5dbe082a9b 
>   src/main/python/apache/thermos/observer/http/static_assets.py 
> 83adeb3e699274483cb72910b0d25f7b30098f24 
>   src/main/python/apache/thermos/observer/observed_task.py 
> 08540e11523b0db990de57fe1ce7517047e2632b 
>   src/main/python/apache/thermos/observer/task_observer.py 
> 4bb5d239e81fe4659397f899760c0e8853e93786 
>   src/main/python/apache/thermos/runner/thermos_runner.py 
> fa4f0fb7452bf8b85f9f3fa27283f54d7c0fe4f2 
> 
> 
> Diff: https://reviews.apache.org/r/66136/diff/1/
> 
> 
> Testing
> ---
> 
> Manually verified that the observer debug logs still contain useful output.
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>



Re: Review Request 66139: Speedup regular Thermos observer checkpoint refresh

2018-03-19 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66139/#review199490
---



@ReviewBot retry

- Stephan Erb


On March 19, 2018, 5:24 p.m., Stephan Erb wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66139/
> ---
> 
> (Updated March 19, 2018, 5:24 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Renan DelValle.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Profiling indicated that roughly 70% of the refresh time was spend in 
> `os.path.realpath`. 
> This was introduced in https://reviews.apache.org/r/35580/ to properly handle 
> the `latest`
> symlink in the Mesos folder layout. 
> 
> This patch takes a slightly different approach to solve this problem based on 
> `os.path.islink`.
> The latter is faster as it just needs to look at a single folder rather than 
> an entire path.
> 
> To facilitate this optimization the patch removes unused functionality from 
> ExecutorDetector.
> Most probably this  was used by former GC executor that got removed a few 
> years ago.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/BUILD 
> 486230db34a22ea5dd0f68da911c0afb1afbcac0 
>   src/main/python/apache/aurora/executor/common/executor_detector.py 
> a07bfc34caa5f86153ace8184b061e253c39e92e 
>   src/main/python/apache/aurora/executor/common/path_detector.py 
> ed264d74ef5a5a7aa681a56b340f9b16504a88ad 
>   src/main/python/apache/thermos/monitoring/detector.py 
> 6e5a620e9f49e0960a814f4c1f701a21cc7678fd 
>   src/test/python/apache/aurora/executor/common/test_executor_detector.py 
> d2a948f2e95cbc1723b63462787c4256cc56731d 
>   src/test/python/apache/aurora/executor/common/test_path_detector.py 
> 7b5ef0cf552d22d4cfbf3357071de036551026dc 
> 
> 
> Diff: https://reviews.apache.org/r/66139/diff/1/
> 
> 
> Testing
> ---
> 
> I have tested this build on a node with 75 running tasks and 4500 finished 
> ones.
> 
> Before this patch:
> 
> D0319 09:05:07.219057 14514 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 2.26s
> D0319 09:05:14.412610 14514 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 2.39s
> D0319 09:05:21.600985 14514 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 2.29s
>  
> With this patch:
> 
> D0319 09:00:24.084657 5427 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 0.77s
> D0319 09:00:29.869699 5427 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 0.78s
> D0319 09:00:35.643407 5427 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 0.77s
> 
> The regular 2-3 second freezes when navigating the Thermos UI are now almost 
> gone for me.
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>



Re: Review Request 66136: Switch Thermos to lazy log formatting

2018-03-19 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66136/#review199488
---



Master (e0e90e1) is green with this patch.
  ./build-support/jenkins/build.sh

However, it appears that it might lack test coverage.

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On March 19, 2018, 7:55 a.m., Stephan Erb wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66136/
> ---
> 
> (Updated March 19, 2018, 7:55 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Renan DelValle.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first part of a small series of Thermos observer performance 
> improvements.
> 
> As a first iteration, this switches all logging to use the logger-embedded 
> formatting
> rather than doing it eager up front. This has the advantage that we produce 
> less garbage
> if debug logging is disabled.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/thermos/common/ckpt.py 
> e79ec6ac42483bfab4ccd94390d0f2af95903a95 
>   src/main/python/apache/thermos/core/helper.py 
> 0811e846ce8e39a22d55732604876c37d2caa239 
>   src/main/python/apache/thermos/core/muxer.py 
> 47e77f7685891545b2526e71e016ffcd454f938c 
>   src/main/python/apache/thermos/core/process.py 
> 4a4678ff39c84cb87836aca19365c5b2aabc4fa4 
>   src/main/python/apache/thermos/core/runner.py 
> 1b63c083246f8ec1a43cb1686968c2b1bb4a14b5 
>   src/main/python/apache/thermos/monitoring/disk.py 
> 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/monitor.py 
> d77703e9027e6c23ffb67d936cf8beef0384b4a6 
>   src/main/python/apache/thermos/monitoring/process_collector_psutil.py 
> 3000e95e2930a01f57ac855960b5db7aabbfc0ca 
>   src/main/python/apache/thermos/monitoring/resource.py 
> f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/http/file_browser.py 
> d0996651312a763ed68a68ae358524cfb4a94d29 
>   src/main/python/apache/thermos/observer/http/http_observer.py 
> 5bfc4f28bfbbcb654a059a3579844f5dbe082a9b 
>   src/main/python/apache/thermos/observer/http/static_assets.py 
> 83adeb3e699274483cb72910b0d25f7b30098f24 
>   src/main/python/apache/thermos/observer/observed_task.py 
> 08540e11523b0db990de57fe1ce7517047e2632b 
>   src/main/python/apache/thermos/observer/task_observer.py 
> 4bb5d239e81fe4659397f899760c0e8853e93786 
>   src/main/python/apache/thermos/runner/thermos_runner.py 
> fa4f0fb7452bf8b85f9f3fa27283f54d7c0fe4f2 
> 
> 
> Diff: https://reviews.apache.org/r/66136/diff/1/
> 
> 
> Testing
> ---
> 
> Manually verified that the observer debug logs still contain useful output.
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>



Review Request 66152: Custom executor support for Aurora DSL

2018-03-19 Thread Renan DelValle

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66152/
---

Review request for Aurora.


Repository: aurora


Description
---

Adding support for launching Jobs that use a Custom Executor using the Aurora 
DSL.


Diffs
-

  RELEASE-NOTES.md b5d06c42badd3b623ad365a5d34a7162e8c782f2 
  docs/features/custom-executors.md 1357c1ed334dbd90ac9937c557766ce31686c293 
  docs/reference/configuration.md 725e073e8057f0e5e17f50285d86c89638c96d16 
  src/main/python/apache/aurora/config/schema/base.py 
3d57d6a68b9553938c8ef8a15e6ac88d466f7c3c 
  src/main/python/apache/aurora/config/thrift.py 
dcabb03500c8e60d5585f0e65c74e163bd93e43a 
  src/test/python/apache/aurora/client/cli/test_inspect.py 
8c624800eed383383dfb07c10aae103c54875728 
  src/test/python/apache/aurora/config/test_thrift.py 
7bf050836633780401c5d8a371e006bde025c420 


Diff: https://reviews.apache.org/r/66152/diff/1/


Testing
---


Thanks,

Renan DelValle



Re: Review Request 66103: Introduce mesos disk collector

2018-03-19 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199448
---



Approach looks good. Few comments on improving the interface and 
code-structuring.

Instead of plumbing the new arguments all the way through to 
`TaskResourceMonitor` can we build a partial factory method in `ThermoObserver` 
since we already have all the information at this point itself.


3rdparty/python/requirements.txt
Lines 23 (patched)


Any reason not using the more widely used `jq`?



examples/vagrant/systemd/aurora-executor.service
Lines 22-25 (patched)


Snake-case arguments like `log_to_disk`.



src/main/python/apache/thermos/monitoring/disk.py
Lines 153 (patched)


Calling it API_URL is misleading since HTTP endpoints have both the regular 
status, flags and metrics endpoints and the new HTTP API as well.

s/API_URL/AGENT_HTTP_ENDPOINT/

We have APIs under /api and that is not the ones we are calling here.



src/main/python/apache/thermos/monitoring/disk.py
Lines 154 (patched)


Can we use the `/containers` which should list all containers (AFAIK) and 
get rid of the `API_PATH` configuration parameter?



src/main/python/apache/thermos/monitoring/resource.py
Line 158 (original), 159 (patched)


Call this `disk_collector_class`? It reads a little wierd when we call this 
`disk_collector`, meaning it is the actual object to be used.



src/main/python/apache/thermos/monitoring/resource.py
Lines 164 (patched)


Can we combine this via partial function to the `disk_collector_class` 
argument? This will keep the constructor more idiomatic.


- Santhosh Kumar Shanmugham


On March 19, 2018, 7:58 a.m., Reza Motamedi wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> ---
> 
> (Updated March 19, 2018, 7:58 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, 
> Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage 
> for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, 
> essentially repeating the work already done by the agent. In practice, we see 
> that disk monitoring is one of the most expensive resource monitoring tasks. 
> For instance, when there are deeply nested directories, the CPU utilization 
> of the observer process can easily reach 1.5 CPUs. It would be ideal if we 
> delegate the disk monitoring task to the agent and do it only once. With this 
> approach, when disk collection has improved in the agent (for instance by 
> implementing XFS isolation), we can simply benefit from it without any code 
> change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API 
> endpoint to lookup disk_used_bytes. Note that there is also resource 
> monitoring in thermos executor. Currently, I left the disk collector there to 
> use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and 
> `etc_mesos-slave/isolation` for testing. They can be left as is. I included 
> them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 
> 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/aurora-executor.service 
> 5a1a9082ecd7b1367ec677d760a5c375b6db9076 
>   src/main/python/apache/aurora/tools/thermos_observer.py 
> dd9f0c46ceac9e939b1b763073314161de0ea614 
>   src/main/python/apache/thermos/monitoring/BUILD 
> 65ba7088f65e7baa5d30744736ba456b46a55e86 
>   src/main/python/apache/thermos/monitoring/disk.py 
> 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/resource.py 
> f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/task_observer.py 
> 4bb5d239e81fe4659397f899760c0e8853e93786 
>   
> src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py
>  fe74bd1d3ecd89fca1b5b2251202cbbc0f24 
>   

Re: Review Request 66136: Switch Thermos to lazy log formatting

2018-03-19 Thread Jordan Ly

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66136/#review199447
---


Ship it!




Ship It!

- Jordan Ly


On March 19, 2018, 2:55 p.m., Stephan Erb wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66136/
> ---
> 
> (Updated March 19, 2018, 2:55 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Renan DelValle.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first part of a small series of Thermos observer performance 
> improvements.
> 
> As a first iteration, this switches all logging to use the logger-embedded 
> formatting
> rather than doing it eager up front. This has the advantage that we produce 
> less garbage
> if debug logging is disabled.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/thermos/common/ckpt.py 
> e79ec6ac42483bfab4ccd94390d0f2af95903a95 
>   src/main/python/apache/thermos/core/helper.py 
> 0811e846ce8e39a22d55732604876c37d2caa239 
>   src/main/python/apache/thermos/core/muxer.py 
> 47e77f7685891545b2526e71e016ffcd454f938c 
>   src/main/python/apache/thermos/core/process.py 
> 4a4678ff39c84cb87836aca19365c5b2aabc4fa4 
>   src/main/python/apache/thermos/core/runner.py 
> 1b63c083246f8ec1a43cb1686968c2b1bb4a14b5 
>   src/main/python/apache/thermos/monitoring/disk.py 
> 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/monitor.py 
> d77703e9027e6c23ffb67d936cf8beef0384b4a6 
>   src/main/python/apache/thermos/monitoring/process_collector_psutil.py 
> 3000e95e2930a01f57ac855960b5db7aabbfc0ca 
>   src/main/python/apache/thermos/monitoring/resource.py 
> f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/http/file_browser.py 
> d0996651312a763ed68a68ae358524cfb4a94d29 
>   src/main/python/apache/thermos/observer/http/http_observer.py 
> 5bfc4f28bfbbcb654a059a3579844f5dbe082a9b 
>   src/main/python/apache/thermos/observer/http/static_assets.py 
> 83adeb3e699274483cb72910b0d25f7b30098f24 
>   src/main/python/apache/thermos/observer/observed_task.py 
> 08540e11523b0db990de57fe1ce7517047e2632b 
>   src/main/python/apache/thermos/observer/task_observer.py 
> 4bb5d239e81fe4659397f899760c0e8853e93786 
>   src/main/python/apache/thermos/runner/thermos_runner.py 
> fa4f0fb7452bf8b85f9f3fa27283f54d7c0fe4f2 
> 
> 
> Diff: https://reviews.apache.org/r/66136/diff/1/
> 
> 
> Testing
> ---
> 
> Manually verified that the observer debug logs still contain useful output.
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>



Re: Review Request 66136: Switch Thermos to lazy log formatting

2018-03-19 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66136/#review199446
---


Ship it!




Ship It!

- Santhosh Kumar Shanmugham


On March 19, 2018, 7:55 a.m., Stephan Erb wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66136/
> ---
> 
> (Updated March 19, 2018, 7:55 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Renan DelValle.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first part of a small series of Thermos observer performance 
> improvements.
> 
> As a first iteration, this switches all logging to use the logger-embedded 
> formatting
> rather than doing it eager up front. This has the advantage that we produce 
> less garbage
> if debug logging is disabled.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/thermos/common/ckpt.py 
> e79ec6ac42483bfab4ccd94390d0f2af95903a95 
>   src/main/python/apache/thermos/core/helper.py 
> 0811e846ce8e39a22d55732604876c37d2caa239 
>   src/main/python/apache/thermos/core/muxer.py 
> 47e77f7685891545b2526e71e016ffcd454f938c 
>   src/main/python/apache/thermos/core/process.py 
> 4a4678ff39c84cb87836aca19365c5b2aabc4fa4 
>   src/main/python/apache/thermos/core/runner.py 
> 1b63c083246f8ec1a43cb1686968c2b1bb4a14b5 
>   src/main/python/apache/thermos/monitoring/disk.py 
> 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/monitor.py 
> d77703e9027e6c23ffb67d936cf8beef0384b4a6 
>   src/main/python/apache/thermos/monitoring/process_collector_psutil.py 
> 3000e95e2930a01f57ac855960b5db7aabbfc0ca 
>   src/main/python/apache/thermos/monitoring/resource.py 
> f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/http/file_browser.py 
> d0996651312a763ed68a68ae358524cfb4a94d29 
>   src/main/python/apache/thermos/observer/http/http_observer.py 
> 5bfc4f28bfbbcb654a059a3579844f5dbe082a9b 
>   src/main/python/apache/thermos/observer/http/static_assets.py 
> 83adeb3e699274483cb72910b0d25f7b30098f24 
>   src/main/python/apache/thermos/observer/observed_task.py 
> 08540e11523b0db990de57fe1ce7517047e2632b 
>   src/main/python/apache/thermos/observer/task_observer.py 
> 4bb5d239e81fe4659397f899760c0e8853e93786 
>   src/main/python/apache/thermos/runner/thermos_runner.py 
> fa4f0fb7452bf8b85f9f3fa27283f54d7c0fe4f2 
> 
> 
> Diff: https://reviews.apache.org/r/66136/diff/1/
> 
> 
> Testing
> ---
> 
> Manually verified that the observer debug logs still contain useful output.
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>



Re: Review Request 66136: Switch Thermos to lazy log formatting

2018-03-19 Thread Jordan Ly


> On March 19, 2018, 5:21 p.m., Aurora ReviewBot wrote:
> > Master (aaadad7) is red with this patch.
> >   ./build-support/jenkins/build.sh
> > 
> > Pass 2: Analyzing classes (319 / 332) - 96% complete
> > Pass 2: Analyzing classes (320 / 332) - 96% complete
> > Pass 2: Analyzing classes (321 / 332) - 96% complete
> > Pass 2: Analyzing classes (322 / 332) - 96% complete
> > Pass 2: Analyzing classes (323 / 332) - 97% complete
> > Pass 2: Analyzing classes (324 / 332) - 97% complete
> > Pass 2: Analyzing classes (325 / 332) - 97% complete
> > Pass 2: Analyzing classes (326 / 332) - 98% complete
> > Pass 2: Analyzing classes (327 / 332) - 98% complete
> > Pass 2: Analyzing classes (328 / 332) - 98% complete
> > Pass 2: Analyzing classes (329 / 332) - 99% complete
> > Pass 2: Analyzing classes (330 / 332) - 99% complete
> > Pass 2: Analyzing classes (331 / 332) - 99% complete
> > Pass 2: Analyzing classes (332 / 332) - 100% complete
> > Done with analysis
> > :test
> > 
> > org.apache.aurora.scheduler.events.WebhookTest > 
> > testTaskChangedWithOldStateError FAILED
> > java.lang.AssertionError at WebhookTest.java:251
> > I0319 17:20:50.604 [ShutdownHook, SchedulerMain] Stopping scheduler 
> > services. 
> > 
> > 1081 tests completed, 1 failed, 1 skipped
> > :test FAILED
> > :jacocoTestReport
> > Coverage report generated: 
> > file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/jacoco/test/html/index.html
> > :jacocoTestCoverageVerification
> > 
> > FAILURE: Build failed with an exception.
> > 
> > * What went wrong:
> > Execution failed for task ':test'.
> > > There were failing tests. See the report at: 
> > > file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/tests/test/index.html
> > 
> > * Try:
> > Run with --stacktrace option to get the stack trace. Run with --info or 
> > --debug option to get more log output.
> > 
> > * Get more help at https://help.gradle.org
> > 
> > BUILD FAILED in 6m 56s
> > 45 actionable tasks: 36 executed, 9 up-to-date
> > 
> > 
> > I will refresh this build result if you post a review containing 
> > "@ReviewBot retry"

Very surprised to see this error popping up again. I cannot reproduce this 
locally, and after https://reviews.apache.org/r/64523/ it gives 60 seconds for 
`onThrowable` to complete which seems like it should be much more than enough. 
Perhaps the build box is running slowly?


- Jordan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66136/#review199438
---


On March 19, 2018, 2:55 p.m., Stephan Erb wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66136/
> ---
> 
> (Updated March 19, 2018, 2:55 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Renan DelValle.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first part of a small series of Thermos observer performance 
> improvements.
> 
> As a first iteration, this switches all logging to use the logger-embedded 
> formatting
> rather than doing it eager up front. This has the advantage that we produce 
> less garbage
> if debug logging is disabled.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/thermos/common/ckpt.py 
> e79ec6ac42483bfab4ccd94390d0f2af95903a95 
>   src/main/python/apache/thermos/core/helper.py 
> 0811e846ce8e39a22d55732604876c37d2caa239 
>   src/main/python/apache/thermos/core/muxer.py 
> 47e77f7685891545b2526e71e016ffcd454f938c 
>   src/main/python/apache/thermos/core/process.py 
> 4a4678ff39c84cb87836aca19365c5b2aabc4fa4 
>   src/main/python/apache/thermos/core/runner.py 
> 1b63c083246f8ec1a43cb1686968c2b1bb4a14b5 
>   src/main/python/apache/thermos/monitoring/disk.py 
> 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/monitor.py 
> d77703e9027e6c23ffb67d936cf8beef0384b4a6 
>   src/main/python/apache/thermos/monitoring/process_collector_psutil.py 
> 3000e95e2930a01f57ac855960b5db7aabbfc0ca 
>   src/main/python/apache/thermos/monitoring/resource.py 
> f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/http/file_browser.py 
> d0996651312a763ed68a68ae358524cfb4a94d29 
>   src/main/python/apache/thermos/observer/http/http_observer.py 
> 5bfc4f28bfbbcb654a059a3579844f5dbe082a9b 
>   src/main/python/apache/thermos/observer/http/static_assets.py 
> 83adeb3e699274483cb72910b0d25f7b30098f24 
>   src/main/python/apache/thermos/observer/observed_task.py 
> 08540e11523b0db990de57fe1ce7517047e2632b 
>   src/main/python/apache/thermos/observer/task_observer.py 
> 4bb5d239e81fe4659397f899760c0e8853e93786 
>   src/main/python/apache/thermos/runner/thermos_runner.py 
> fa4f0fb7452bf8b85f9f3fa27283f54d7c0fe4f2 
> 
> 
> Diff: 

Re: Review Request 66139: Speedup regular Thermos observer checkpoint refresh

2018-03-19 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66139/#review199442
---



Master (aaadad7) is red with this patch.
  ./build-support/jenkins/build.sh

Pass 2: Analyzing classes (319 / 332) - 96% complete
Pass 2: Analyzing classes (320 / 332) - 96% complete
Pass 2: Analyzing classes (321 / 332) - 96% complete
Pass 2: Analyzing classes (322 / 332) - 96% complete
Pass 2: Analyzing classes (323 / 332) - 97% complete
Pass 2: Analyzing classes (324 / 332) - 97% complete
Pass 2: Analyzing classes (325 / 332) - 97% complete
Pass 2: Analyzing classes (326 / 332) - 98% complete
Pass 2: Analyzing classes (327 / 332) - 98% complete
Pass 2: Analyzing classes (328 / 332) - 98% complete
Pass 2: Analyzing classes (329 / 332) - 99% complete
Pass 2: Analyzing classes (330 / 332) - 99% complete
Pass 2: Analyzing classes (331 / 332) - 99% complete
Pass 2: Analyzing classes (332 / 332) - 100% complete
Done with analysis
:test

org.apache.aurora.scheduler.events.WebhookTest > 
testTaskChangedWithOldStateError FAILED
java.lang.AssertionError at WebhookTest.java:251
I0319 17:27:43.181 [ShutdownHook, SchedulerMain] Stopping scheduler services. 

1081 tests completed, 1 failed, 1 skipped
:test FAILED
:jacocoTestReport
Coverage report generated: 
file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/jacoco/test/html/index.html
:jacocoTestCoverageVerification

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':test'.
> There were failing tests. See the report at: 
> file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/tests/test/index.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug 
option to get more log output.

* Get more help at https://help.gradle.org

BUILD FAILED in 6m 54s
45 actionable tasks: 36 executed, 9 up-to-date


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On March 19, 2018, 4:24 p.m., Stephan Erb wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66139/
> ---
> 
> (Updated March 19, 2018, 4:24 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Renan DelValle.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Profiling indicated that roughly 70% of the refresh time was spend in 
> `os.path.realpath`. 
> This was introduced in https://reviews.apache.org/r/35580/ to properly handle 
> the `latest`
> symlink in the Mesos folder layout. 
> 
> This patch takes a slightly different approach to solve this problem based on 
> `os.path.islink`.
> The latter is faster as it just needs to look at a single folder rather than 
> an entire path.
> 
> To facilitate this optimization the patch removes unused functionality from 
> ExecutorDetector.
> Most probably this  was used by former GC executor that got removed a few 
> years ago.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/executor/BUILD 
> 486230db34a22ea5dd0f68da911c0afb1afbcac0 
>   src/main/python/apache/aurora/executor/common/executor_detector.py 
> a07bfc34caa5f86153ace8184b061e253c39e92e 
>   src/main/python/apache/aurora/executor/common/path_detector.py 
> ed264d74ef5a5a7aa681a56b340f9b16504a88ad 
>   src/main/python/apache/thermos/monitoring/detector.py 
> 6e5a620e9f49e0960a814f4c1f701a21cc7678fd 
>   src/test/python/apache/aurora/executor/common/test_executor_detector.py 
> d2a948f2e95cbc1723b63462787c4256cc56731d 
>   src/test/python/apache/aurora/executor/common/test_path_detector.py 
> 7b5ef0cf552d22d4cfbf3357071de036551026dc 
> 
> 
> Diff: https://reviews.apache.org/r/66139/diff/1/
> 
> 
> Testing
> ---
> 
> I have tested this build on a node with 75 running tasks and 4500 finished 
> ones.
> 
> Before this patch:
> 
> D0319 09:05:07.219057 14514 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 2.26s
> D0319 09:05:14.412610 14514 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 2.39s
> D0319 09:05:21.600985 14514 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 2.29s
>  
> With this patch:
> 
> D0319 09:00:24.084657 5427 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 0.77s
> D0319 09:00:29.869699 5427 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 0.78s
> D0319 09:00:35.643407 5427 task_observer.py:142] TaskObserver: finished 
> checkpoint refresh in 0.77s
> 
> The regular 2-3 second freezes when navigating the Thermos UI are now almost 
> gone for me.
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>



Re: Review Request 66136: Switch Thermos to lazy log formatting

2018-03-19 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66136/#review199438
---



Master (aaadad7) is red with this patch.
  ./build-support/jenkins/build.sh

Pass 2: Analyzing classes (319 / 332) - 96% complete
Pass 2: Analyzing classes (320 / 332) - 96% complete
Pass 2: Analyzing classes (321 / 332) - 96% complete
Pass 2: Analyzing classes (322 / 332) - 96% complete
Pass 2: Analyzing classes (323 / 332) - 97% complete
Pass 2: Analyzing classes (324 / 332) - 97% complete
Pass 2: Analyzing classes (325 / 332) - 97% complete
Pass 2: Analyzing classes (326 / 332) - 98% complete
Pass 2: Analyzing classes (327 / 332) - 98% complete
Pass 2: Analyzing classes (328 / 332) - 98% complete
Pass 2: Analyzing classes (329 / 332) - 99% complete
Pass 2: Analyzing classes (330 / 332) - 99% complete
Pass 2: Analyzing classes (331 / 332) - 99% complete
Pass 2: Analyzing classes (332 / 332) - 100% complete
Done with analysis
:test

org.apache.aurora.scheduler.events.WebhookTest > 
testTaskChangedWithOldStateError FAILED
java.lang.AssertionError at WebhookTest.java:251
I0319 17:20:50.604 [ShutdownHook, SchedulerMain] Stopping scheduler services. 

1081 tests completed, 1 failed, 1 skipped
:test FAILED
:jacocoTestReport
Coverage report generated: 
file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/jacoco/test/html/index.html
:jacocoTestCoverageVerification

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':test'.
> There were failing tests. See the report at: 
> file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/tests/test/index.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug 
option to get more log output.

* Get more help at https://help.gradle.org

BUILD FAILED in 6m 56s
45 actionable tasks: 36 executed, 9 up-to-date


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On March 19, 2018, 2:55 p.m., Stephan Erb wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66136/
> ---
> 
> (Updated March 19, 2018, 2:55 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Renan DelValle.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first part of a small series of Thermos observer performance 
> improvements.
> 
> As a first iteration, this switches all logging to use the logger-embedded 
> formatting
> rather than doing it eager up front. This has the advantage that we produce 
> less garbage
> if debug logging is disabled.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/thermos/common/ckpt.py 
> e79ec6ac42483bfab4ccd94390d0f2af95903a95 
>   src/main/python/apache/thermos/core/helper.py 
> 0811e846ce8e39a22d55732604876c37d2caa239 
>   src/main/python/apache/thermos/core/muxer.py 
> 47e77f7685891545b2526e71e016ffcd454f938c 
>   src/main/python/apache/thermos/core/process.py 
> 4a4678ff39c84cb87836aca19365c5b2aabc4fa4 
>   src/main/python/apache/thermos/core/runner.py 
> 1b63c083246f8ec1a43cb1686968c2b1bb4a14b5 
>   src/main/python/apache/thermos/monitoring/disk.py 
> 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/monitor.py 
> d77703e9027e6c23ffb67d936cf8beef0384b4a6 
>   src/main/python/apache/thermos/monitoring/process_collector_psutil.py 
> 3000e95e2930a01f57ac855960b5db7aabbfc0ca 
>   src/main/python/apache/thermos/monitoring/resource.py 
> f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/http/file_browser.py 
> d0996651312a763ed68a68ae358524cfb4a94d29 
>   src/main/python/apache/thermos/observer/http/http_observer.py 
> 5bfc4f28bfbbcb654a059a3579844f5dbe082a9b 
>   src/main/python/apache/thermos/observer/http/static_assets.py 
> 83adeb3e699274483cb72910b0d25f7b30098f24 
>   src/main/python/apache/thermos/observer/observed_task.py 
> 08540e11523b0db990de57fe1ce7517047e2632b 
>   src/main/python/apache/thermos/observer/task_observer.py 
> 4bb5d239e81fe4659397f899760c0e8853e93786 
>   src/main/python/apache/thermos/runner/thermos_runner.py 
> fa4f0fb7452bf8b85f9f3fa27283f54d7c0fe4f2 
> 
> 
> Diff: https://reviews.apache.org/r/66136/diff/1/
> 
> 
> Testing
> ---
> 
> Manually verified that the observer debug logs still contain useful output.
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>



Re: Review Request 65896: Persist scheduler/observer logs to /var/log/aurora/[FILE].log

2018-03-19 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65896/#review199434
---


Ship it!




Ship It!

- Santhosh Kumar Shanmugham


On March 14, 2018, 3:23 p.m., Jordan Ly wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65896/
> ---
> 
> (Updated March 14, 2018, 3:23 p.m.)
> 
> 
> Review request for Aurora, Renan DelValle and Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Persist scheduler/observer logs to /var/log/aurora/[FILE].log
> 
> `journalctl -u aurora-[executor|scheduler]` still works.
> 
> 
> Diffs
> -
> 
>   docs/getting-started/vagrant.md 73d0affaa6cc9a31781dce54b4c05a1686d35b01 
>   examples/vagrant/aurorabuild.sh 5eb1822a7940abcf58903d5c60a1f004cf2439c9 
>   examples/vagrant/provision-dev-cluster.sh 
> 4a023907b57a6912936533c6d45a25fc929bbab8 
>   examples/vagrant/systemd/aurora-executor.service 
> 5a1a9082ecd7b1367ec677d760a5c375b6db9076 
>   examples/vagrant/systemd/aurora-scheduler-kerberos.service 
> 1c2cfc4c8edd3f66675d843ef8a6518e2916bc63 
>   examples/vagrant/systemd/aurora-scheduler.service 
> 524cbb3b028b45157bcc082756f48a531c52ced3 
> 
> 
> Diff: https://reviews.apache.org/r/65896/diff/2/
> 
> 
> Testing
> ---
> 
> Manually tested that logs were created on vagrant.
> 
> ```
> vagrant@aurora:~$ ls /var/log/aurora/
> observer.log  scheduler.log
> vagrant@aurora:~$ tail -3 /var/log/aurora/scheduler.log
> Mar  5 00:44:13 aurora aurora-scheduler[3609]: I0305 00:44:13.441440  3722 
> scheduler.cpp:676] Enqueuing event HEARTBEAT received from 
> http://192.168.33.7:5050/master/api/v1/scheduler
> Mar  5 00:44:28 aurora aurora-scheduler[3609]: I0305 00:44:28.442432  3727 
> scheduler.cpp:676] Enqueuing event HEARTBEAT received from 
> http://192.168.33.7:5050/master/api/v1/scheduler
> Mar  5 00:44:43 aurora aurora-scheduler[3609]: I0305 00:44:43.444105  3726 
> scheduler.cpp:676] Enqueuing event HEARTBEAT received from 
> http://192.168.33.7:5050/master/api/v1/scheduler
> vagrant@aurora:~$ journalctl -u aurora-scheduler | tail -3
> Mar 05 00:44:28 aurora aurora-scheduler[3609]: I0305 00:44:28.442432  3727 
> scheduler.cpp:676] Enqueuing event HEARTBEAT received from 
> http://192.168.33.7:5050/master/api/v1/scheduler
> Mar 05 00:44:43 aurora aurora-scheduler[3609]: I0305 00:44:43.444105  3726 
> scheduler.cpp:676] Enqueuing event HEARTBEAT received from 
> http://192.168.33.7:5050/master/api/v1/scheduler
> Mar 05 00:44:58 aurora aurora-scheduler[3609]: I0305 00:44:58.444711  3729 
> scheduler.cpp:676] Enqueuing event HEARTBEAT received from 
> http://192.168.33.7:5050/master/api/v1/scheduler
> vagrant@aurora:~$ tail -3 /var/log/aurora/observer.log
> Mar  5 00:36:18 aurora aurora-observer[3129]: Bottle v0.11.6 server starting 
> up (using CherryPyServer())...
> Mar  5 00:36:18 aurora aurora-observer[3129]: Listening on 
> http://192.168.33.7:1338/
> Mar  5 00:36:18 aurora aurora-observer[3129]: Hit Ctrl-C to quit.
> vagrant@aurora:~$ journalctl -u aurora-executor | tail -3
> Mar 05 00:36:18 aurora aurora-observer[3129]: Bottle v0.11.6 server starting 
> up (using CherryPyServer())...
> Mar 05 00:36:18 aurora aurora-observer[3129]: Listening on 
> http://192.168.33.7:1338/
> Mar 05 00:36:18 aurora aurora-observer[3129]: Hit Ctrl-C to quit.
> ```
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>



Re: Review Request 66136: Switch Thermos to lazy log formatting

2018-03-19 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66136/#review199421
---



@ReviewBot retry

- Stephan Erb


On March 19, 2018, 3:55 p.m., Stephan Erb wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66136/
> ---
> 
> (Updated March 19, 2018, 3:55 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Renan DelValle.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first part of a small series of Thermos observer performance 
> improvements.
> 
> As a first iteration, this switches all logging to use the logger-embedded 
> formatting
> rather than doing it eager up front. This has the advantage that we produce 
> less garbage
> if debug logging is disabled.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/thermos/common/ckpt.py 
> e79ec6ac42483bfab4ccd94390d0f2af95903a95 
>   src/main/python/apache/thermos/core/helper.py 
> 0811e846ce8e39a22d55732604876c37d2caa239 
>   src/main/python/apache/thermos/core/muxer.py 
> 47e77f7685891545b2526e71e016ffcd454f938c 
>   src/main/python/apache/thermos/core/process.py 
> 4a4678ff39c84cb87836aca19365c5b2aabc4fa4 
>   src/main/python/apache/thermos/core/runner.py 
> 1b63c083246f8ec1a43cb1686968c2b1bb4a14b5 
>   src/main/python/apache/thermos/monitoring/disk.py 
> 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/monitor.py 
> d77703e9027e6c23ffb67d936cf8beef0384b4a6 
>   src/main/python/apache/thermos/monitoring/process_collector_psutil.py 
> 3000e95e2930a01f57ac855960b5db7aabbfc0ca 
>   src/main/python/apache/thermos/monitoring/resource.py 
> f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/http/file_browser.py 
> d0996651312a763ed68a68ae358524cfb4a94d29 
>   src/main/python/apache/thermos/observer/http/http_observer.py 
> 5bfc4f28bfbbcb654a059a3579844f5dbe082a9b 
>   src/main/python/apache/thermos/observer/http/static_assets.py 
> 83adeb3e699274483cb72910b0d25f7b30098f24 
>   src/main/python/apache/thermos/observer/observed_task.py 
> 08540e11523b0db990de57fe1ce7517047e2632b 
>   src/main/python/apache/thermos/observer/task_observer.py 
> 4bb5d239e81fe4659397f899760c0e8853e93786 
>   src/main/python/apache/thermos/runner/thermos_runner.py 
> fa4f0fb7452bf8b85f9f3fa27283f54d7c0fe4f2 
> 
> 
> Diff: https://reviews.apache.org/r/66136/diff/1/
> 
> 
> Testing
> ---
> 
> Manually verified that the observer debug logs still contain useful output.
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>



Re: Review Request 66136: Switch Thermos to lazy log formatting

2018-03-19 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66136/#review199420
---



Master (aaadad7) is red with this patch.
  ./build-support/jenkins/build.sh

at org.easymock.internal.ReplayState.invoke(ReplayState.java:46)
at 
org.easymock.internal.MockInvocationHandler.invoke(MockInvocationHandler.java:40)
at 
org.easymock.internal.ObjectMethodsFilter.invoke(ObjectMethodsFilter.java:94)
at com.sun.proxy.$Proxy21.changeState(Unknown Source)
at 
org.apache.aurora.scheduler.TaskStatusHandlerImpl.lambda$run$0(TaskStatusHandlerImpl.java:158)
at 
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144)
at 
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139)
at 
org.apache.aurora.scheduler.storage.testing.StorageTestUtil.lambda$expectWrite$1(StorageTestUtil.java:83)
at org.easymock.internal.Result.answer(Result.java:106)
at org.easymock.internal.ReplayState.invokeInner(ReplayState.java:60)
at org.easymock.internal.ReplayState.invoke(ReplayState.java:46)
at 
org.easymock.internal.MockInvocationHandler.invoke(MockInvocationHandler.java:40)
at 
org.easymock.internal.ObjectMethodsFilter.invoke(ObjectMethodsFilter.java:94)
at com.sun.proxy.$Proxy20.write(Unknown Source)
at 
org.apache.aurora.scheduler.TaskStatusHandlerImpl.run(TaskStatusHandlerImpl.java:154)
at 
com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
at com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
at java.lang.Thread.run(Thread.java:748)

I0319 16:27:21.402 [ShutdownHook, SchedulerMain] Stopping scheduler services. 

1081 tests completed, 1 failed, 1 skipped
:test FAILED
:jacocoTestReport
Coverage report generated: 
file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/jacoco/test/html/index.html
:jacocoTestCoverageVerification

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':test'.
> There were failing tests. See the report at: 
> file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/tests/test/index.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug 
option to get more log output.

* Get more help at https://help.gradle.org

BUILD FAILED in 7m 45s
45 actionable tasks: 36 executed, 9 up-to-date


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On March 19, 2018, 2:55 p.m., Stephan Erb wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66136/
> ---
> 
> (Updated March 19, 2018, 2:55 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Renan DelValle.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first part of a small series of Thermos observer performance 
> improvements.
> 
> As a first iteration, this switches all logging to use the logger-embedded 
> formatting
> rather than doing it eager up front. This has the advantage that we produce 
> less garbage
> if debug logging is disabled.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/thermos/common/ckpt.py 
> e79ec6ac42483bfab4ccd94390d0f2af95903a95 
>   src/main/python/apache/thermos/core/helper.py 
> 0811e846ce8e39a22d55732604876c37d2caa239 
>   src/main/python/apache/thermos/core/muxer.py 
> 47e77f7685891545b2526e71e016ffcd454f938c 
>   src/main/python/apache/thermos/core/process.py 
> 4a4678ff39c84cb87836aca19365c5b2aabc4fa4 
>   src/main/python/apache/thermos/core/runner.py 
> 1b63c083246f8ec1a43cb1686968c2b1bb4a14b5 
>   src/main/python/apache/thermos/monitoring/disk.py 
> 52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
>   src/main/python/apache/thermos/monitoring/monitor.py 
> d77703e9027e6c23ffb67d936cf8beef0384b4a6 
>   src/main/python/apache/thermos/monitoring/process_collector_psutil.py 
> 3000e95e2930a01f57ac855960b5db7aabbfc0ca 
>   src/main/python/apache/thermos/monitoring/resource.py 
> f5e3849ca6682c6d4720698be869ca6b9f703b94 
>   src/main/python/apache/thermos/observer/http/file_browser.py 
> d0996651312a763ed68a68ae358524cfb4a94d29 
>   src/main/python/apache/thermos/observer/http/http_observer.py 
> 5bfc4f28bfbbcb654a059a3579844f5dbe082a9b 
>   src/main/python/apache/thermos/observer/http/static_assets.py 
> 83adeb3e699274483cb72910b0d25f7b30098f24 
>   src/main/python/apache/thermos/observer/observed_task.py 
> 08540e11523b0db990de57fe1ce7517047e2632b 
>   src/main/python/apache/thermos/observer/task_observer.py 
> 4bb5d239e81fe4659397f899760c0e8853e93786 
>   

Review Request 66139: Speedup regular Thermos observer checkpoint refresh

2018-03-19 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66139/
---

Review request for Aurora, Jordan Ly and Renan DelValle.


Repository: aurora


Description
---

Profiling indicated that roughly 70% of the refresh time was spend in 
`os.path.realpath`. 
This was introduced in https://reviews.apache.org/r/35580/ to properly handle 
the `latest`
symlink in the Mesos folder layout. 

This patch takes a slightly different approach to solve this problem based on 
`os.path.islink`.
The latter is faster as it just needs to look at a single folder rather than an 
entire path.

To facilitate this optimization the patch removes unused functionality from 
ExecutorDetector.
Most probably this  was used by former GC executor that got removed a few years 
ago.


Diffs
-

  src/main/python/apache/aurora/executor/BUILD 
486230db34a22ea5dd0f68da911c0afb1afbcac0 
  src/main/python/apache/aurora/executor/common/executor_detector.py 
a07bfc34caa5f86153ace8184b061e253c39e92e 
  src/main/python/apache/aurora/executor/common/path_detector.py 
ed264d74ef5a5a7aa681a56b340f9b16504a88ad 
  src/main/python/apache/thermos/monitoring/detector.py 
6e5a620e9f49e0960a814f4c1f701a21cc7678fd 
  src/test/python/apache/aurora/executor/common/test_executor_detector.py 
d2a948f2e95cbc1723b63462787c4256cc56731d 
  src/test/python/apache/aurora/executor/common/test_path_detector.py 
7b5ef0cf552d22d4cfbf3357071de036551026dc 


Diff: https://reviews.apache.org/r/66139/diff/1/


Testing
---

I have tested this build on a node with 75 running tasks and 4500 finished ones.

Before this patch:

D0319 09:05:07.219057 14514 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 2.26s
D0319 09:05:14.412610 14514 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 2.39s
D0319 09:05:21.600985 14514 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 2.29s
 
With this patch:

D0319 09:00:24.084657 5427 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 0.77s
D0319 09:00:29.869699 5427 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 0.78s
D0319 09:00:35.643407 5427 task_observer.py:142] TaskObserver: finished 
checkpoint refresh in 0.77s

The regular 2-3 second freezes when navigating the Thermos UI are now almost 
gone for me.


Thanks,

Stephan Erb



Re: Review Request 66103: Introduce mesos disk collector

2018-03-19 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/#review199418
---



Master (aaadad7) is red with this patch.
  ./build-support/jenkins/build.sh

at org.easymock.internal.ReplayState.invoke(ReplayState.java:46)
at 
org.easymock.internal.MockInvocationHandler.invoke(MockInvocationHandler.java:40)
at 
org.easymock.internal.ObjectMethodsFilter.invoke(ObjectMethodsFilter.java:94)
at com.sun.proxy.$Proxy21.changeState(Unknown Source)
at 
org.apache.aurora.scheduler.TaskStatusHandlerImpl.lambda$run$0(TaskStatusHandlerImpl.java:158)
at 
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:144)
at 
org.apache.aurora.scheduler.storage.Storage$MutateWork$NoResult.apply(Storage.java:139)
at 
org.apache.aurora.scheduler.storage.testing.StorageTestUtil.lambda$expectWrite$1(StorageTestUtil.java:83)
at org.easymock.internal.Result.answer(Result.java:106)
at org.easymock.internal.ReplayState.invokeInner(ReplayState.java:60)
at org.easymock.internal.ReplayState.invoke(ReplayState.java:46)
at 
org.easymock.internal.MockInvocationHandler.invoke(MockInvocationHandler.java:40)
at 
org.easymock.internal.ObjectMethodsFilter.invoke(ObjectMethodsFilter.java:94)
at com.sun.proxy.$Proxy20.write(Unknown Source)
at 
org.apache.aurora.scheduler.TaskStatusHandlerImpl.run(TaskStatusHandlerImpl.java:154)
at 
com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
at com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
at java.lang.Thread.run(Thread.java:748)

I0319 16:19:42.150 [ShutdownHook, SchedulerMain] Stopping scheduler services. 

1081 tests completed, 1 failed, 1 skipped
:test FAILED
:jacocoTestReport
Coverage report generated: 
file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/jacoco/test/html/index.html
:jacocoTestCoverageVerification

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':test'.
> There were failing tests. See the report at: 
> file:///home/jenkins/jenkins-slave/workspace/AuroraBot/dist/reports/tests/test/index.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug 
option to get more log output.

* Get more help at https://help.gradle.org

BUILD FAILED in 7m 40s
45 actionable tasks: 36 executed, 9 up-to-date


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On March 19, 2018, 2:58 p.m., Reza Motamedi wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66103/
> ---
> 
> (Updated March 19, 2018, 2:58 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, 
> Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> When disk isolation is enabled in a Mesos agent it calculates the disk usage 
> for each container. 
> Thermos Observer also monitors disk usage using `twitter.common.dirutil`, 
> essentially repeating the work already done by the agent. In practice, we see 
> that disk monitoring is one of the most expensive resource monitoring tasks. 
> For instance, when there are deeply nested directories, the CPU utilization 
> of the observer process can easily reach 1.5 CPUs. It would be ideal if we 
> delegate the disk monitoring task to the agent and do it only once. With this 
> approach, when disk collection has improved in the agent (for instance by 
> implementing XFS isolation), we can simply benefit from it without any code 
> change. Some more information about the problem is provided in AURORA-1918.
> 
> This patch that introduces `MesosDiskCollector` which queries the agent's API 
> endpoint to lookup disk_used_bytes. Note that there is also resource 
> monitoring in thermos executor. Currently, I left the disk collector there to 
> use the `du` implementation. That can be changed in a later patch.
> 
> I modified some vagrant config files including `aurora-executor.service` and 
> `etc_mesos-slave/isolation` for testing. They can be left as is. I included 
> them in this patch to show how this would work e2e.
> 
> 
> Diffs
> -
> 
>   3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
>   examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
>   examples/vagrant/mesos_config/etc_mesos-slave/isolation 
> 1a7028ffc70116b104ef3ad22b7388f637707a0f 
>   examples/vagrant/systemd/aurora-executor.service 
> 5a1a9082ecd7b1367ec677d760a5c375b6db9076 
>   

Review Request 66103: Introduce mesos disk collector

2018-03-19 Thread Reza Motamedi

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66103/
---

Review request for Aurora, David McLaughlin, Daniel Knightly, Jordan Ly, 
Santhosh Kumar Shanmugham, and Stephan Erb.


Repository: aurora


Description
---

When disk isolation is enabled in a Mesos agent it calculates the disk usage 
for each container. 
Thermos Observer also monitors disk usage using `twitter.common.dirutil`, 
essentially repeating the work already done by the agent. In practice, we see 
that disk monitoring is one of the most expensive resource monitoring tasks. 
For instance, when there are deeply nested directories, the CPU utilization of 
the observer process can easily reach 1.5 CPUs. It would be ideal if we 
delegate the disk monitoring task to the agent and do it only once. With this 
approach, when disk collection has improved in the agent (for instance by 
implementing XFS isolation), we can simply benefit from it without any code 
change. Some more information about the problem is provided in AURORA-1918.

This patch that introduces `MesosDiskCollector` which queries the agent's API 
endpoint to lookup disk_used_bytes. Note that there is also resource monitoring 
in thermos executor. Currently, I left the disk collector there to use the `du` 
implementation. That can be changed in a later patch.

I modified some vagrant config files including `aurora-executor.service` and 
`etc_mesos-slave/isolation` for testing. They can be left as is. I included 
them in this patch to show how this would work e2e.


Diffs
-

  3rdparty/python/requirements.txt 4ac242cfa2c1c19cb7447816ab86e748839d3d11 
  examples/jobs/hello_world.aurora 5401bfebe753b5e53abd08baeac501144ced9b5a 
  examples/vagrant/mesos_config/etc_mesos-slave/isolation 
1a7028ffc70116b104ef3ad22b7388f637707a0f 
  examples/vagrant/systemd/aurora-executor.service 
5a1a9082ecd7b1367ec677d760a5c375b6db9076 
  src/main/python/apache/aurora/tools/thermos_observer.py 
dd9f0c46ceac9e939b1b763073314161de0ea614 
  src/main/python/apache/thermos/monitoring/BUILD 
65ba7088f65e7baa5d30744736ba456b46a55e86 
  src/main/python/apache/thermos/monitoring/disk.py 
52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
  src/main/python/apache/thermos/monitoring/resource.py 
f5e3849ca6682c6d4720698be869ca6b9f703b94 
  src/main/python/apache/thermos/observer/task_observer.py 
4bb5d239e81fe4659397f899760c0e8853e93786 
  
src/test/python/apache/aurora/executor/common/test_resource_manager_integration.py
 fe74bd1d3ecd89fca1b5b2251202cbbc0f24 
  src/test/python/apache/thermos/monitoring/BUILD 
8f2b39336dce6c7b580e6ba0009f60afdcb89179 
  src/test/python/apache/thermos/monitoring/test_disk.py 
362393bfd1facf3198e2d438d0596b16700b72b8 


Diff: https://reviews.apache.org/r/66103/diff/1/


Testing
---

I added unit tests.
Tested in vagrant and it works as intenced.
I also built and deployed in our test enviroment. In order to measure imporoved 
performance I created jobs with nested folders and noticed reduction in CPU 
utilization of the Observer process, by at least 60%. (1.5 CPU cores to 0.4 CPU 
cores)


Thanks,

Reza Motamedi



Review Request 66136: Switch Thermos to lazy log formatting

2018-03-19 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66136/
---

Review request for Aurora, Jordan Ly and Renan DelValle.


Repository: aurora


Description
---

This is the first part of a small series of Thermos observer performance 
improvements.

As a first iteration, this switches all logging to use the logger-embedded 
formatting
rather than doing it eager up front. This has the advantage that we produce 
less garbage
if debug logging is disabled.


Diffs
-

  src/main/python/apache/thermos/common/ckpt.py 
e79ec6ac42483bfab4ccd94390d0f2af95903a95 
  src/main/python/apache/thermos/core/helper.py 
0811e846ce8e39a22d55732604876c37d2caa239 
  src/main/python/apache/thermos/core/muxer.py 
47e77f7685891545b2526e71e016ffcd454f938c 
  src/main/python/apache/thermos/core/process.py 
4a4678ff39c84cb87836aca19365c5b2aabc4fa4 
  src/main/python/apache/thermos/core/runner.py 
1b63c083246f8ec1a43cb1686968c2b1bb4a14b5 
  src/main/python/apache/thermos/monitoring/disk.py 
52c5d74fd70b5942ea3ef5101ba3f27bfc98fc21 
  src/main/python/apache/thermos/monitoring/monitor.py 
d77703e9027e6c23ffb67d936cf8beef0384b4a6 
  src/main/python/apache/thermos/monitoring/process_collector_psutil.py 
3000e95e2930a01f57ac855960b5db7aabbfc0ca 
  src/main/python/apache/thermos/monitoring/resource.py 
f5e3849ca6682c6d4720698be869ca6b9f703b94 
  src/main/python/apache/thermos/observer/http/file_browser.py 
d0996651312a763ed68a68ae358524cfb4a94d29 
  src/main/python/apache/thermos/observer/http/http_observer.py 
5bfc4f28bfbbcb654a059a3579844f5dbe082a9b 
  src/main/python/apache/thermos/observer/http/static_assets.py 
83adeb3e699274483cb72910b0d25f7b30098f24 
  src/main/python/apache/thermos/observer/observed_task.py 
08540e11523b0db990de57fe1ce7517047e2632b 
  src/main/python/apache/thermos/observer/task_observer.py 
4bb5d239e81fe4659397f899760c0e8853e93786 
  src/main/python/apache/thermos/runner/thermos_runner.py 
fa4f0fb7452bf8b85f9f3fa27283f54d7c0fe4f2 


Diff: https://reviews.apache.org/r/66136/diff/1/


Testing
---

Manually verified that the observer debug logs still contain useful output.


Thanks,

Stephan Erb