Re: Review Request 51929: Scheduling multiple tasks per round.

2016-09-15 Thread Maxim Khutornenko


> On Sept. 16, 2016, 1:20 a.m., Aurora ReviewBot wrote:
> > Master (783baae) is red with this patch.
> >   ./build-support/jenkins/build.sh
> > 
> >  # Create file stdout for capturing output. 
> > We can't use StringIO mock
> >  # because TestProcess is running fork.
> >  with open(os.path.join(td, 'sys_stdout'), 
> > 'w+') as stdout:
> >    with open(os.path.join(td, 
> > 'sys_stderr'), 'w+') as stderr:
> >  with mutable_sys():
> >    sys.stdout, sys.stderr = stdout, 
> > stderr
> >  
> >    p = TestProcess('process', 'echo 
> > hello world; echo >&2 hello stderr', 0,
> >    taskpath, sandbox, 
> > logger_destination=LoggerDestination.BOTH)
> >    p.start()
> >    rc = 
> > wait_for_rc(taskpath.getpath('process_checkpoint'))
> >  
> >    assert rc == 0
> >    # Check log files were created in 
> > std path with correct content
> >  > assert_log_content(taskpath, 
> > 'stdout', 'hello world\n')
> >  
> >  
> > src/test/python/apache/thermos/core/test_process.py:487: 
> >  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> >  
> >  taskpath =  > at 0x7fdd3cd73b10>
> >  log_name = 'stdout'
> >  expected_content = 'hello world\n'
> >  
> >  def assert_log_content(taskpath, log_name, 
> > expected_content):
> >    log = 
> > taskpath.with_filename(log_name).getpath('process_logdir')
> >    assert os.path.exists(log)
> >    with open(log, 'r') as fp:
> >  >   assert fp.read() == expected_content
> >  E   assert '' == 'hello world\n'
> >  E + hello world
> >  
> >  
> > src/test/python/apache/thermos/core/test_process.py:313: AssertionError
> >   generated xml file: 
> > /home/jenkins/jenkins-slave/workspace/AuroraBot/dist/test-results/415337499eb72578eab327a6487c1f5c9452b3d6.xml
> >  
> >   1 failed, 710 passed, 6 skipped, 1 warnings 
> > in 226.09 seconds 
> >  
> > FAILURE
> > 
> > 
> > 01:19:57 04:18   [complete]
> >FAILURE
> > 
> > 
> > I will refresh this build result if you post a review containing 
> > "@ReviewBot retry"

@ReviewBot retry


- Maxim


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51929/#review149162
---


On Sept. 16, 2016, 12:51 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51929/
> ---
> 
> (Updated Sept. 16, 2016, 12:51 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is phase 2 of scheduling perf improvement effort started in 
> https://reviews.apache.org/r/51759/.
> 
> We can now take multiple (configurable) number of task IDs from a given 
> `TaskGroup` per scheduling. The idea is to go deeper through the offer queue 
> and assign more than one task if possible. This approach delivers 
> substantially better MTTA and still ensures fairness across multiple 
> `TaskGroups`. We have observed almost linear improvement in MTTA (4x+ with 5 
> tasks per round), which suggest the `max_tasks_per_schedule_attempt` can be 
> set even higher if the majority of cluster jobs have large number of 
> instances and/or update batch sizes.
> 
> As far as a single round perf goes, we can consider the following 2 
> worst-case scenarios:
> - master: single task scheduling fails after trying all offers in the queue
> - this patch: N tasks launched with the very last N offers in the queue + `(N 
> x single_task_launch_latency)`
> 
> Assuming that matching N tasks against M offers takes exactly the same time 
> as 1 task against M offers (as they all share the same `TaskGroup`), the only 
> measurable difference comes from the additional `N x 
> 

Re: Review Request 51929: Scheduling multiple tasks per round.

2016-09-15 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51929/#review149162
---



Master (783baae) is red with this patch.
  ./build-support/jenkins/build.sh

 # Create file stdout for capturing output. We 
can't use StringIO mock
 # because TestProcess is running fork.
 with open(os.path.join(td, 'sys_stdout'), 
'w+') as stdout:
   with open(os.path.join(td, 'sys_stderr'), 
'w+') as stderr:
 with mutable_sys():
   sys.stdout, sys.stderr = stdout, 
stderr
 
   p = TestProcess('process', 'echo hello 
world; echo >&2 hello stderr', 0,
   taskpath, sandbox, 
logger_destination=LoggerDestination.BOTH)
   p.start()
   rc = 
wait_for_rc(taskpath.getpath('process_checkpoint'))
 
   assert rc == 0
   # Check log files were created in std 
path with correct content
 > assert_log_content(taskpath, 'stdout', 
'hello world\n')
 
 src/test/python/apache/thermos/core/test_process.py:487: 
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
 
 taskpath = 
 log_name = 'stdout'
 expected_content = 'hello world\n'
 
 def assert_log_content(taskpath, log_name, 
expected_content):
   log = 
taskpath.with_filename(log_name).getpath('process_logdir')
   assert os.path.exists(log)
   with open(log, 'r') as fp:
 >   assert fp.read() == expected_content
 E   assert '' == 'hello world\n'
 E + hello world
 
 src/test/python/apache/thermos/core/test_process.py:313: 
AssertionError
  generated xml file: 
/home/jenkins/jenkins-slave/workspace/AuroraBot/dist/test-results/415337499eb72578eab327a6487c1f5c9452b3d6.xml
 
  1 failed, 710 passed, 6 skipped, 1 warnings in 
226.09 seconds 
 
FAILURE


01:19:57 04:18   [complete]
   FAILURE


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 16, 2016, 12:51 a.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51929/
> ---
> 
> (Updated Sept. 16, 2016, 12:51 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is phase 2 of scheduling perf improvement effort started in 
> https://reviews.apache.org/r/51759/.
> 
> We can now take multiple (configurable) number of task IDs from a given 
> `TaskGroup` per scheduling. The idea is to go deeper through the offer queue 
> and assign more than one task if possible. This approach delivers 
> substantially better MTTA and still ensures fairness across multiple 
> `TaskGroups`. We have observed almost linear improvement in MTTA (4x+ with 5 
> tasks per round), which suggest the `max_tasks_per_schedule_attempt` can be 
> set even higher if the majority of cluster jobs have large number of 
> instances and/or update batch sizes.
> 
> As far as a single round perf goes, we can consider the following 2 
> worst-case scenarios:
> - master: single task scheduling fails after trying all offers in the queue
> - this patch: N tasks launched with the very last N offers in the queue + `(N 
> x single_task_launch_latency)`
> 
> Assuming that matching N tasks against M offers takes exactly the same time 
> as 1 task against M offers (as they all share the same `TaskGroup`), the only 
> measurable difference comes from the additional `N x 
> single_task_launch_latency` overhead. Based on real cluster observations, the 
> `single_task_launch_latency` is less than 1% of a single task scheduling 
> attempt, which is << than the savings from avoided additional scheduling 
> rounds. 
> 
> As far as jmh results go, the new approach (batching + multiple tasks per 
> round) is only 

Review Request 51929: Scheduling multiple tasks per round.

2016-09-15 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51929/
---

Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.


Repository: aurora


Description
---

This is phase 2 of scheduling perf improvement effort started in 
https://reviews.apache.org/r/51759/.

We can now take multiple (configurable) number of task IDs from a given 
`TaskGroup` per scheduling. The idea is to go deeper through the offer queue 
and assign more than one task if possible. This approach delivers substantially 
better MTTA and still ensures fairness across multiple `TaskGroups`. We have 
observed almost linear improvement in MTTA (4x+ with 5 tasks per round), which 
suggest the `max_tasks_per_schedule_attempt` can be set even higher if the 
majority of cluster jobs have large number of instances and/or update batch 
sizes.

As far as a single round perf goes, we can consider the following 2 worst-case 
scenarios:
- master: single task scheduling fails after trying all offers in the queue
- this patch: N tasks launched with the very last N offers in the queue + `(N x 
single_task_launch_latency)`

Assuming that matching N tasks against M offers takes exactly the same time as 
1 task against M offers (as they all share the same `TaskGroup`), the only 
measurable difference comes from the additional `N x 
single_task_launch_latency` overhead. Based on real cluster observations, the 
`single_task_launch_latency` is less than 1% of a single task scheduling 
attempt, which is << than the savings from avoided additional scheduling 
rounds. 

As far as jmh results go, the new approach (batching + multiple tasks per 
round) is only slightly more demanding (~8%). Both results though are MUCH 
higher than the real cluster perf, which just confirms we are not bound by CPU 
time here:

Master:
```
Benchmark
Mode  Cnt  Score Error  Units
SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  
thrpt   10  17126.183 ± 488.425  ops/s
```

This patch:
```
Benchmark
Mode  Cnt  Score Error  Units
SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  
thrpt   10  15838.051 ± 187.890  ops/s
```

NOTE: this will not apply cleanly as it branched off of 
https://reviews.apache.org/r/51765, which itself depends on 
https://reviews.apache.org/r/51759/.


Diffs
-

  src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java 
9d0d40b82653fb923bed16d06546288a1576c21d 
  src/main/java/org/apache/aurora/scheduler/filter/AttributeAggregate.java 
87b9e1928ab2d44668df1123f32ffdc4197c0c70 
  src/main/java/org/apache/aurora/scheduler/scheduling/SchedulingModule.java 
11e8033438ad0808e446e41bb26b3fa4c04136c7 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroup.java 
5d319557057e27fd5fc6d3e553e9ca9139399c50 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
c044ebe6f72183a67462bbd8e5be983eb592c3e9 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
d266f6a25ae2360db2977c43768a19b1f1efe8ff 
  src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
7f7b4358ef05c0f0d0e14daac1a5c25488467dc9 
  
src/test/java/org/apache/aurora/scheduler/events/NotifyingSchedulingFilterTest.java
 ece476b918e6f2c128039e561eea23a94d8ed396 
  src/test/java/org/apache/aurora/scheduler/filter/AttributeAggregateTest.java 
209f9298a1d55207b9b41159f2ab366f92c1eb70 
  
src/test/java/org/apache/aurora/scheduler/filter/SchedulingFilterImplTest.java 
0cf23df9f373c0d9b27e55a12adefd5f5fd81ba5 
  src/test/java/org/apache/aurora/scheduler/http/AbstractJettyTest.java 
c2ceb4e7685a9301f8014a9183e02fbad65bca26 
  
src/test/java/org/apache/aurora/scheduler/preemptor/PreemptionVictimFilterTest.java
 ee5c6528af89cc62a35fdb314358c489556d8131 
  src/test/java/org/apache/aurora/scheduler/preemptor/PreemptorImplTest.java 
98048fabc00f233925b6cca015c2525980556e2b 
  src/test/java/org/apache/aurora/scheduler/preemptor/PreemptorModuleTest.java 
2c3e5f32c774be07a5fa28c8bcf3b9a5d88059a1 
  src/test/java/org/apache/aurora/scheduler/scheduling/TaskGroupsTest.java 
95cf25eda0a5bfc0cc4c46d1439ebe9d5359ce79 
  
src/test/java/org/apache/aurora/scheduler/scheduling/TaskSchedulerImplTest.java 
72562e6bd9a9860c834e6a9faa094c28600a8fed 
  src/test/java/org/apache/aurora/scheduler/state/TaskAssignerImplTest.java 
b4d27f69ad5d4cce03da9f04424dc35d30e8af29 

Diff: https://reviews.apache.org/r/51929/diff/


Testing
---

All types of testing including deploying to test and production clusters.


Thanks,

Maxim Khutornenko



Re: Review Request 51874: Change framework_name default value from 'TwitterScheduler' to 'Aurora'

2016-09-15 Thread Stephan Erb


> On Sept. 15, 2016, 12:48 vorm., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java,
> >  line 82
> > 
> >
> > Did you try to rollback to pre 0.15 scheduler while changing the 
> > framework name? Trying to see if we can drop this 'backwards incompatible' 
> > statement now.
> 
> Santhosh Kumar Shanmugham wrote:
> Tested "roll-forward" (to Aurora) and "roll-back" (via release and config 
> change) (to TwitterScheduler) on Aurora-0.14 (depends on Mesos-0.27.2) and 
> Aurora-0.15(dependes on Mesos-0.28.2). The master was able to re-register the 
> framework with the same "id" and the running tasks were continuing to make 
> progress. (See details in testing section)
> 
> However I could not rollback the scheduler from 0.15 to 0.14 from source 
> inside vagrant. Started to on "aurorabuild all" complain with message,
> "Could not satisfy all requirements for mesos.native==0.27.2"
> 
> Santhosh Kumar Shanmugham wrote:
> Tested changing the framework_name on Aurora 0.14, 0.15 and master. 
> Dropping the comment about 'backward incompatible'.
> 
> Zameer Manji wrote:
> Just to be clear, you tested this change against a single Mesos master 
> verison right? Could you share which version of Mesos that was?
> 
> Santhosh Kumar Shanmugham wrote:
> I made 2 sets of tests, one in vagrant and another against a test 
> cluster. Below are the master versions for the different envs.
> 
> Inside the Vagrant box, Mesos master's version changed based on the 
> release (Vagrantfile changes).
> - 0.14 => 0.27.x
> - 0.15 => 0.28.x
> - latest => 1.0.x
> 
> In the test cluster the Mesos master version was at 1.0.0. (Attempting to 
> run scheduler against Mesos-0.28 failed due to inconsistency in the Mesos jar 
> version.)

I have done the name change on all our clusters. We are still on 0.28, so all 
good.


- Stephan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51874/#review148988
---


On Sept. 15, 2016, 9:02 nachm., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51874/
> ---
> 
> (Updated Sept. 15, 2016, 9:02 nachm.)
> 
> 
> Review request for Aurora, Joshua Cohen and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1688
> https://issues.apache.org/jira/browse/AURORA-1688
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Change framework_name default value from 'TwitterScheduler' to 'Aurora'
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md ad2c68a6defe07c94480d7dee5b1496b50dc34e5 
>   
> src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
>  8a386bd208956eb0c8c2f48874b0c6fb3af58872 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
> 97677f24a50963178a123b420d7ac136e4fde3fe 
> 
> Diff: https://reviews.apache.org/r/51874/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> Testing to make sure backward compatibility:
> 
> # HEAD of master:
> 
> # Case 1: Rolling forward does not impact running tasks:
> Renaming framework from 'TwitterScheduler' to 'Aurora':
> 
> The framework re-registers after restart (treated by master as failover) and 
> gets the same framework-id. Running task remain unaffected.
> 
> Master log:
> I0914 16:48:28.408182  9815 master.cpp:1297] Giving framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (TwitterScheduler) at 
> scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 3weeks to 
> failover
> I0914 16:48:28.408226  9815 hierarchical.cpp:382] Deactivated framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> E0914 16:48:28.408617  9819 process.cpp:2105] Failed to shutdown socket with 
> fd 28: Transport endpoint is not connected
> I0914 16:48:43.722126  9813 master.cpp:2424] Received SUBSCRIBE call for 
> framework 'Aurora' at 
> scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083
> I0914 16:48:43.722190  9813 master.cpp:2500] Subscribing framework Aurora 
> with checkpointing enabled and capabilities [ REVOCABLE_RESOURCES, 
> GPU_RESOURCES ]
> I0914 16:48:43.75  9813 master.cpp:2564] Updating info for framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> I0914 16:48:43.722256  9813 master.cpp:2577] Framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (Aurora) at 
> scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 failed over
> I0914 16:48:43.722429  9813 hierarchical.cpp:348] Activated framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> I0914 16:48:43.722595  9813 

Re: Review Request 51874: Change framework_name default value from 'TwitterScheduler' to 'Aurora'

2016-09-15 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51874/#review149115
---


Ship it!




Master (783baae) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 15, 2016, 7:02 p.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51874/
> ---
> 
> (Updated Sept. 15, 2016, 7:02 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1688
> https://issues.apache.org/jira/browse/AURORA-1688
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Change framework_name default value from 'TwitterScheduler' to 'Aurora'
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md ad2c68a6defe07c94480d7dee5b1496b50dc34e5 
>   
> src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
>  8a386bd208956eb0c8c2f48874b0c6fb3af58872 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
> 97677f24a50963178a123b420d7ac136e4fde3fe 
> 
> Diff: https://reviews.apache.org/r/51874/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> Testing to make sure backward compatibility:
> 
> # HEAD of master:
> 
> # Case 1: Rolling forward does not impact running tasks:
> Renaming framework from 'TwitterScheduler' to 'Aurora':
> 
> The framework re-registers after restart (treated by master as failover) and 
> gets the same framework-id. Running task remain unaffected.
> 
> Master log:
> I0914 16:48:28.408182  9815 master.cpp:1297] Giving framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (TwitterScheduler) at 
> scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 3weeks to 
> failover
> I0914 16:48:28.408226  9815 hierarchical.cpp:382] Deactivated framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> E0914 16:48:28.408617  9819 process.cpp:2105] Failed to shutdown socket with 
> fd 28: Transport endpoint is not connected
> I0914 16:48:43.722126  9813 master.cpp:2424] Received SUBSCRIBE call for 
> framework 'Aurora' at 
> scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083
> I0914 16:48:43.722190  9813 master.cpp:2500] Subscribing framework Aurora 
> with checkpointing enabled and capabilities [ REVOCABLE_RESOURCES, 
> GPU_RESOURCES ]
> I0914 16:48:43.75  9813 master.cpp:2564] Updating info for framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> I0914 16:48:43.722256  9813 master.cpp:2577] Framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (Aurora) at 
> scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 failed over
> I0914 16:48:43.722429  9813 hierarchical.cpp:348] Activated framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> I0914 16:48:43.722595  9813 master.cpp:5709] Sending 1 offers to framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (Aurora) at 
> scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083
> 
> Scheduler log:
> I0914 16:48:44.157 [Thread-10, MesosSchedulerImpl:151] Registered with ID 
> value: "071c44a1-b4d4-4339-a727-03a79f725851-"
> , master: id: "461b98b8-63e1-40e3-96fd-cb62420945ae"
> ip: 119646400
> port: 5050
> pid: "master@192.168.33.7:5050"
> hostname: "aurora.local"
> version: "1.0.0"
> address {
>   hostname: "aurora.local"
>   ip: "192.168.33.7"
>   port: 5050
> }
> 
> # Case 2: Rolling backward does not impact running tasks:
> Rolling back framework name from 'Aurora' to 'TwitterScheduler':
> 
> The framework re-registers after restart (treated by master as failover) and 
> gets the same framework-id. Running task remain unaffected.
> 
> Master log:
> I0914 16:51:33.203495  9812 master.cpp:1297] Giving framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (Aurora) at 
> scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083 3weeks to 
> failover
> I0914 16:51:33.203526  9812 hierarchical.cpp:382] Deactivated framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> I0914 16:51:49.614074  9813 master.cpp:2424] Received SUBSCRIBE call for 
> framework 'TwitterScheduler' at 
> scheduler-6fa8b819-aed9-42e1-9c6c-3e4be2f62500@192.168.33.7:8083
> I0914 16:51:49.614215  9813 master.cpp:2500] Subscribing framework 
> TwitterScheduler with checkpointing enabled and capabilities [ 
> REVOCABLE_RESOURCES, GPU_RESOURCES ]
> I0914 16:51:49.614312  9813 master.cpp:2564] Updating info for framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> I0914 16:51:49.614359  9813 master.cpp:2577] Framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (TwitterScheduler) at 
> 

Re: Review Request 51874: Change framework_name default value from 'TwitterScheduler' to 'Aurora'

2016-09-15 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51874/
---

(Updated Sept. 15, 2016, 12:02 p.m.)


Review request for Aurora, Joshua Cohen and Maxim Khutornenko.


Changes
---

Dropping comment about "backward incompatibility" when using framework_name.


Bugs: AURORA-1688
https://issues.apache.org/jira/browse/AURORA-1688


Repository: aurora


Description
---

Change framework_name default value from 'TwitterScheduler' to 'Aurora'


Diffs (updated)
-

  RELEASE-NOTES.md ad2c68a6defe07c94480d7dee5b1496b50dc34e5 
  
src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
 8a386bd208956eb0c8c2f48874b0c6fb3af58872 
  src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
97677f24a50963178a123b420d7ac136e4fde3fe 

Diff: https://reviews.apache.org/r/51874/diff/


Testing
---

./build-support/jenkins/build.sh
./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh

Testing to make sure backward compatibility:

# HEAD of master:

# Case 1: Rolling forward does not impact running tasks:
Renaming framework from 'TwitterScheduler' to 'Aurora':

The framework re-registers after restart (treated by master as failover) and 
gets the same framework-id. Running task remain unaffected.

Master log:
I0914 16:48:28.408182  9815 master.cpp:1297] Giving framework 
071c44a1-b4d4-4339-a727-03a79f725851- (TwitterScheduler) at 
scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 3weeks to 
failover
I0914 16:48:28.408226  9815 hierarchical.cpp:382] Deactivated framework 
071c44a1-b4d4-4339-a727-03a79f725851-
E0914 16:48:28.408617  9819 process.cpp:2105] Failed to shutdown socket with fd 
28: Transport endpoint is not connected
I0914 16:48:43.722126  9813 master.cpp:2424] Received SUBSCRIBE call for 
framework 'Aurora' at 
scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083
I0914 16:48:43.722190  9813 master.cpp:2500] Subscribing framework Aurora with 
checkpointing enabled and capabilities [ REVOCABLE_RESOURCES, GPU_RESOURCES ]
I0914 16:48:43.75  9813 master.cpp:2564] Updating info for framework 
071c44a1-b4d4-4339-a727-03a79f725851-
I0914 16:48:43.722256  9813 master.cpp:2577] Framework 
071c44a1-b4d4-4339-a727-03a79f725851- (Aurora) at 
scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 failed over
I0914 16:48:43.722429  9813 hierarchical.cpp:348] Activated framework 
071c44a1-b4d4-4339-a727-03a79f725851-
I0914 16:48:43.722595  9813 master.cpp:5709] Sending 1 offers to framework 
071c44a1-b4d4-4339-a727-03a79f725851- (Aurora) at 
scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083

Scheduler log:
I0914 16:48:44.157 [Thread-10, MesosSchedulerImpl:151] Registered with ID 
value: "071c44a1-b4d4-4339-a727-03a79f725851-"
, master: id: "461b98b8-63e1-40e3-96fd-cb62420945ae"
ip: 119646400
port: 5050
pid: "master@192.168.33.7:5050"
hostname: "aurora.local"
version: "1.0.0"
address {
  hostname: "aurora.local"
  ip: "192.168.33.7"
  port: 5050
}

# Case 2: Rolling backward does not impact running tasks:
Rolling back framework name from 'Aurora' to 'TwitterScheduler':

The framework re-registers after restart (treated by master as failover) and 
gets the same framework-id. Running task remain unaffected.

Master log:
I0914 16:51:33.203495  9812 master.cpp:1297] Giving framework 
071c44a1-b4d4-4339-a727-03a79f725851- (Aurora) at 
scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083 3weeks to 
failover
I0914 16:51:33.203526  9812 hierarchical.cpp:382] Deactivated framework 
071c44a1-b4d4-4339-a727-03a79f725851-
I0914 16:51:49.614074  9813 master.cpp:2424] Received SUBSCRIBE call for 
framework 'TwitterScheduler' at 
scheduler-6fa8b819-aed9-42e1-9c6c-3e4be2f62500@192.168.33.7:8083
I0914 16:51:49.614215  9813 master.cpp:2500] Subscribing framework 
TwitterScheduler with checkpointing enabled and capabilities [ 
REVOCABLE_RESOURCES, GPU_RESOURCES ]
I0914 16:51:49.614312  9813 master.cpp:2564] Updating info for framework 
071c44a1-b4d4-4339-a727-03a79f725851-
I0914 16:51:49.614359  9813 master.cpp:2577] Framework 
071c44a1-b4d4-4339-a727-03a79f725851- (TwitterScheduler) at 
scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083 failed over
I0914 16:51:49.614977  9813 hierarchical.cpp:348] Activated framework 
071c44a1-b4d4-4339-a727-03a79f725851-
I0914 16:51:49.615170  9813 master.cpp:5709] Sending 1 offers to framework 
071c44a1-b4d4-4339-a727-03a79f725851- (TwitterScheduler) at 
scheduler-6fa8b819-aed9-42e1-9c6c-3e4be2f62500@192.168.33.7:8083

Scheduler log:
I0914 16:51:50.249 [Thread-10, MesosSchedulerImpl:151] Registered with ID 
value: "071c44a1-b4d4-4339-a727-03a79f725851-"
, master: id: "461b98b8-63e1-40e3-96fd-cb62420945ae"
ip: 119646400
port: 5050
pid: "master@192.168.33.7:5050"
hostname: 

Re: Review Request 51874: Change framework_name default value from 'TwitterScheduler' to 'Aurora'

2016-09-15 Thread Santhosh Kumar Shanmugham


> On Sept. 14, 2016, 3:48 p.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java,
> >  line 82
> > 
> >
> > Did you try to rollback to pre 0.15 scheduler while changing the 
> > framework name? Trying to see if we can drop this 'backwards incompatible' 
> > statement now.
> 
> Santhosh Kumar Shanmugham wrote:
> Tested "roll-forward" (to Aurora) and "roll-back" (via release and config 
> change) (to TwitterScheduler) on Aurora-0.14 (depends on Mesos-0.27.2) and 
> Aurora-0.15(dependes on Mesos-0.28.2). The master was able to re-register the 
> framework with the same "id" and the running tasks were continuing to make 
> progress. (See details in testing section)
> 
> However I could not rollback the scheduler from 0.15 to 0.14 from source 
> inside vagrant. Started to on "aurorabuild all" complain with message,
> "Could not satisfy all requirements for mesos.native==0.27.2"
> 
> Santhosh Kumar Shanmugham wrote:
> Tested changing the framework_name on Aurora 0.14, 0.15 and master. 
> Dropping the comment about 'backward incompatible'.
> 
> Zameer Manji wrote:
> Just to be clear, you tested this change against a single Mesos master 
> verison right? Could you share which version of Mesos that was?

I made 2 sets of tests, one in vagrant and another against a test cluster. 
Below are the master versions for the different envs.

Inside the Vagrant box, Mesos master's version changed based on the release 
(Vagrantfile changes).
- 0.14 => 0.27.x
- 0.15 => 0.28.x
- latest => 1.0.x

In the test cluster the Mesos master version was at 1.0.0. (Attempting to run 
scheduler against Mesos-0.28 failed due to inconsistency in the Mesos jar 
version.)


- Santhosh Kumar


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51874/#review148988
---


On Sept. 14, 2016, 5:33 p.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51874/
> ---
> 
> (Updated Sept. 14, 2016, 5:33 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1688
> https://issues.apache.org/jira/browse/AURORA-1688
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Change framework_name default value from 'TwitterScheduler' to 'Aurora'
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md ad2c68a6defe07c94480d7dee5b1496b50dc34e5 
>   
> src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
>  8a386bd208956eb0c8c2f48874b0c6fb3af58872 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
> 97677f24a50963178a123b420d7ac136e4fde3fe 
> 
> Diff: https://reviews.apache.org/r/51874/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> Testing to make sure backward compatibility:
> 
> # HEAD of master:
> 
> # Case 1: Rolling forward does not impact running tasks:
> Renaming framework from 'TwitterScheduler' to 'Aurora':
> 
> The framework re-registers after restart (treated by master as failover) and 
> gets the same framework-id. Running task remain unaffected.
> 
> Master log:
> I0914 16:48:28.408182  9815 master.cpp:1297] Giving framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (TwitterScheduler) at 
> scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 3weeks to 
> failover
> I0914 16:48:28.408226  9815 hierarchical.cpp:382] Deactivated framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> E0914 16:48:28.408617  9819 process.cpp:2105] Failed to shutdown socket with 
> fd 28: Transport endpoint is not connected
> I0914 16:48:43.722126  9813 master.cpp:2424] Received SUBSCRIBE call for 
> framework 'Aurora' at 
> scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083
> I0914 16:48:43.722190  9813 master.cpp:2500] Subscribing framework Aurora 
> with checkpointing enabled and capabilities [ REVOCABLE_RESOURCES, 
> GPU_RESOURCES ]
> I0914 16:48:43.75  9813 master.cpp:2564] Updating info for framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> I0914 16:48:43.722256  9813 master.cpp:2577] Framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (Aurora) at 
> scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 failed over
> I0914 16:48:43.722429  9813 hierarchical.cpp:348] Activated framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> I0914 16:48:43.722595  9813 master.cpp:5709] Sending 1 offers to framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (Aurora) at 
> scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083
> 
> 

Re: Review Request 51924: Remove --release-threshold option from aurora job restart.

2016-09-15 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51924/#review149107
---



Master (5069f93) is red with this patch.
  ./build-support/jenkins/build.sh

 # Create file stdout for capturing output. We 
can't use StringIO mock
 # because TestProcess is running fork.
 with open(os.path.join(td, 'sys_stdout'), 
'w+') as stdout:
   with open(os.path.join(td, 'sys_stderr'), 
'w+') as stderr:
 with mutable_sys():
   sys.stdout, sys.stderr = stdout, 
stderr
 
   p = TestProcess('process', 'echo hello 
world; echo >&2 hello stderr', 0,
   taskpath, sandbox, 
logger_destination=LoggerDestination.BOTH)
   p.start()
   rc = 
wait_for_rc(taskpath.getpath('process_checkpoint'))
 
   assert rc == 0
   # Check log files were created in std 
path with correct content
 > assert_log_content(taskpath, 'stdout', 
'hello world\n')
 
 src/test/python/apache/thermos/core/test_process.py:487: 
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
 
 taskpath = 
 log_name = 'stdout'
 expected_content = 'hello world\n'
 
 def assert_log_content(taskpath, log_name, 
expected_content):
   log = 
taskpath.with_filename(log_name).getpath('process_logdir')
   assert os.path.exists(log)
   with open(log, 'r') as fp:
 >   assert fp.read() == expected_content
 E   assert '' == 'hello world\n'
 E + hello world
 
 src/test/python/apache/thermos/core/test_process.py:313: 
AssertionError
  generated xml file: 
/home/jenkins/jenkins-slave/workspace/AuroraBot/dist/test-results/415337499eb72578eab327a6487c1f5c9452b3d6.xml
 
  1 failed, 708 passed, 6 skipped, 1 warnings in 
342.57 seconds 
 
FAILURE


18:40:10 06:35   [complete]
   FAILURE


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 15, 2016, 6:13 p.m., Joshua Cohen wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51924/
> ---
> 
> (Updated Sept. 15, 2016, 6:13 p.m.)
> 
> 
> Review request for Aurora and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1681
> https://issues.apache.org/jira/browse/AURORA-1681
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Remove --release-threshold option from aurora job restart.
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md ad2c68a6defe07c94480d7dee5b1496b50dc34e5 
>   src/main/python/apache/aurora/client/cli/jobs.py 
> 7b4c2692334acfddb53a52a602a5f07e94b4bd86 
> 
> Diff: https://reviews.apache.org/r/51924/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Joshua Cohen
> 
>



Re: Review Request 51874: Change framework_name default value from 'TwitterScheduler' to 'Aurora'

2016-09-15 Thread Zameer Manji


> On Sept. 14, 2016, 3:48 p.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java,
> >  line 82
> > 
> >
> > Did you try to rollback to pre 0.15 scheduler while changing the 
> > framework name? Trying to see if we can drop this 'backwards incompatible' 
> > statement now.
> 
> Santhosh Kumar Shanmugham wrote:
> Tested "roll-forward" (to Aurora) and "roll-back" (via release and config 
> change) (to TwitterScheduler) on Aurora-0.14 (depends on Mesos-0.27.2) and 
> Aurora-0.15(dependes on Mesos-0.28.2). The master was able to re-register the 
> framework with the same "id" and the running tasks were continuing to make 
> progress. (See details in testing section)
> 
> However I could not rollback the scheduler from 0.15 to 0.14 from source 
> inside vagrant. Started to on "aurorabuild all" complain with message,
> "Could not satisfy all requirements for mesos.native==0.27.2"
> 
> Santhosh Kumar Shanmugham wrote:
> Tested changing the framework_name on Aurora 0.14, 0.15 and master. 
> Dropping the comment about 'backward incompatible'.

Just to be clear, you tested this change against a single Mesos master verison 
right? Could you share which version of Mesos that was?


- Zameer


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51874/#review148988
---


On Sept. 14, 2016, 5:33 p.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51874/
> ---
> 
> (Updated Sept. 14, 2016, 5:33 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1688
> https://issues.apache.org/jira/browse/AURORA-1688
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Change framework_name default value from 'TwitterScheduler' to 'Aurora'
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md ad2c68a6defe07c94480d7dee5b1496b50dc34e5 
>   
> src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
>  8a386bd208956eb0c8c2f48874b0c6fb3af58872 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
> 97677f24a50963178a123b420d7ac136e4fde3fe 
> 
> Diff: https://reviews.apache.org/r/51874/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> Testing to make sure backward compatibility:
> 
> # HEAD of master:
> 
> # Case 1: Rolling forward does not impact running tasks:
> Renaming framework from 'TwitterScheduler' to 'Aurora':
> 
> The framework re-registers after restart (treated by master as failover) and 
> gets the same framework-id. Running task remain unaffected.
> 
> Master log:
> I0914 16:48:28.408182  9815 master.cpp:1297] Giving framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (TwitterScheduler) at 
> scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 3weeks to 
> failover
> I0914 16:48:28.408226  9815 hierarchical.cpp:382] Deactivated framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> E0914 16:48:28.408617  9819 process.cpp:2105] Failed to shutdown socket with 
> fd 28: Transport endpoint is not connected
> I0914 16:48:43.722126  9813 master.cpp:2424] Received SUBSCRIBE call for 
> framework 'Aurora' at 
> scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083
> I0914 16:48:43.722190  9813 master.cpp:2500] Subscribing framework Aurora 
> with checkpointing enabled and capabilities [ REVOCABLE_RESOURCES, 
> GPU_RESOURCES ]
> I0914 16:48:43.75  9813 master.cpp:2564] Updating info for framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> I0914 16:48:43.722256  9813 master.cpp:2577] Framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (Aurora) at 
> scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 failed over
> I0914 16:48:43.722429  9813 hierarchical.cpp:348] Activated framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> I0914 16:48:43.722595  9813 master.cpp:5709] Sending 1 offers to framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (Aurora) at 
> scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083
> 
> Scheduler log:
> I0914 16:48:44.157 [Thread-10, MesosSchedulerImpl:151] Registered with ID 
> value: "071c44a1-b4d4-4339-a727-03a79f725851-"
> , master: id: "461b98b8-63e1-40e3-96fd-cb62420945ae"
> ip: 119646400
> port: 5050
> pid: "master@192.168.33.7:5050"
> hostname: "aurora.local"
> version: "1.0.0"
> address {
>   hostname: "aurora.local"
>   ip: "192.168.33.7"
>   port: 5050
> }
> 
> # Case 2: Rolling backward does not impact running tasks:
> Rolling back framework name from 

Re: Review Request 51874: Change framework_name default value from 'TwitterScheduler' to 'Aurora'

2016-09-15 Thread Santhosh Kumar Shanmugham


> On Sept. 14, 2016, 3:48 p.m., Maxim Khutornenko wrote:
> > src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java,
> >  line 82
> > 
> >
> > Did you try to rollback to pre 0.15 scheduler while changing the 
> > framework name? Trying to see if we can drop this 'backwards incompatible' 
> > statement now.
> 
> Santhosh Kumar Shanmugham wrote:
> Tested "roll-forward" (to Aurora) and "roll-back" (via release and config 
> change) (to TwitterScheduler) on Aurora-0.14 (depends on Mesos-0.27.2) and 
> Aurora-0.15(dependes on Mesos-0.28.2). The master was able to re-register the 
> framework with the same "id" and the running tasks were continuing to make 
> progress. (See details in testing section)
> 
> However I could not rollback the scheduler from 0.15 to 0.14 from source 
> inside vagrant. Started to on "aurorabuild all" complain with message,
> "Could not satisfy all requirements for mesos.native==0.27.2"

Tested changing the framework_name on Aurora 0.14, 0.15 and master. Dropping 
the comment about 'backward incompatible'.


- Santhosh Kumar


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51874/#review148988
---


On Sept. 14, 2016, 5:33 p.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51874/
> ---
> 
> (Updated Sept. 14, 2016, 5:33 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1688
> https://issues.apache.org/jira/browse/AURORA-1688
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Change framework_name default value from 'TwitterScheduler' to 'Aurora'
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md ad2c68a6defe07c94480d7dee5b1496b50dc34e5 
>   
> src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
>  8a386bd208956eb0c8c2f48874b0c6fb3af58872 
>   src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
> 97677f24a50963178a123b420d7ac136e4fde3fe 
> 
> Diff: https://reviews.apache.org/r/51874/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> Testing to make sure backward compatibility:
> 
> # HEAD of master:
> 
> # Case 1: Rolling forward does not impact running tasks:
> Renaming framework from 'TwitterScheduler' to 'Aurora':
> 
> The framework re-registers after restart (treated by master as failover) and 
> gets the same framework-id. Running task remain unaffected.
> 
> Master log:
> I0914 16:48:28.408182  9815 master.cpp:1297] Giving framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (TwitterScheduler) at 
> scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 3weeks to 
> failover
> I0914 16:48:28.408226  9815 hierarchical.cpp:382] Deactivated framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> E0914 16:48:28.408617  9819 process.cpp:2105] Failed to shutdown socket with 
> fd 28: Transport endpoint is not connected
> I0914 16:48:43.722126  9813 master.cpp:2424] Received SUBSCRIBE call for 
> framework 'Aurora' at 
> scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083
> I0914 16:48:43.722190  9813 master.cpp:2500] Subscribing framework Aurora 
> with checkpointing enabled and capabilities [ REVOCABLE_RESOURCES, 
> GPU_RESOURCES ]
> I0914 16:48:43.75  9813 master.cpp:2564] Updating info for framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> I0914 16:48:43.722256  9813 master.cpp:2577] Framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (Aurora) at 
> scheduler-75517c8f-5913-49e9-8cc4-342a78c9bbcb@192.168.33.7:8083 failed over
> I0914 16:48:43.722429  9813 hierarchical.cpp:348] Activated framework 
> 071c44a1-b4d4-4339-a727-03a79f725851-
> I0914 16:48:43.722595  9813 master.cpp:5709] Sending 1 offers to framework 
> 071c44a1-b4d4-4339-a727-03a79f725851- (Aurora) at 
> scheduler-dfad8309-de4b-47d8-a8f8-82828ea40a12@192.168.33.7:8083
> 
> Scheduler log:
> I0914 16:48:44.157 [Thread-10, MesosSchedulerImpl:151] Registered with ID 
> value: "071c44a1-b4d4-4339-a727-03a79f725851-"
> , master: id: "461b98b8-63e1-40e3-96fd-cb62420945ae"
> ip: 119646400
> port: 5050
> pid: "master@192.168.33.7:5050"
> hostname: "aurora.local"
> version: "1.0.0"
> address {
>   hostname: "aurora.local"
>   ip: "192.168.33.7"
>   port: 5050
> }
> 
> # Case 2: Rolling backward does not impact running tasks:
> Rolling back framework name from 'Aurora' to 'TwitterScheduler':
> 
> The framework re-registers after restart (treated by master as failover) and 
> gets the same framework-id. Running task remain unaffected.

Re: Review Request 51763: Batching writes - Part 2 (of 3): Converting cron jobs to use BatchWorker.

2016-09-15 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51763/#review149093
---


Fix it, then Ship it!




LGTM, the changes are fairly straight forward. I dislike the (slight) 
complexity that comes from using the quartz `JobDataMap` but I'm unsure how 
else we can store the data required in a safe and concurrent manner.


config/findbugs/excludeFilter.xml (line 123)


If I'm reading the code correctly we need this because we have a 
`CompletableFuture` and to give it a value we use `null`.

Have you considered using `new Object()` instead so we don't have to add 
this exception to our rules?



src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java (line 
76)


Just to be clear, this annotation is needed to ensure the data in 
`JobDataMap` is persisted properly?



src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java (line 100)


For integration tests I was under the impression we used 
`org.apache.aurora.scheduler.testing.FakeStatsProvider`


- Zameer Manji


On Sept. 14, 2016, 4:12 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51763/
> ---
> 
> (Updated Sept. 14, 2016, 4:12 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the second part of the `BatchWorker` conversion work that moves cron 
> jobs to use non-blocking kill followups and reduces the number of trigger 
> threads. See https://reviews.apache.org/r/51759 for more background on the 
> `BatchWorker`.
> 
> #Problem
> The current implementation of the cron scheduling relies on a large number of 
> threads (`cron_scheduler_num_threads`=100) to support cron triggering and 
> killing existing tasks according to `KILL_EXISTING` collision policy. This 
> creates large spikes of activities at synchronized intervals as users tend to 
> schedule their cron runs around similar schedules. Moreover, the current 
> implementation re-acquires write locks multiple times to deliver on 
> `KILL_EXISTING` policy. 
> 
> #Remediation
> Trigger level batching is still done in a blocking way but multiple cron 
> triggers may be bundled together to share the same write transaction. Any 
> followups, however, are performed in a non-blocking way by relying on a 
> `BatchWorker.executeWithReplay()` and the `BatchWorkCompleted` notification. 
> In order to still ensure non-concurrent execution of a given job key trigger, 
> a token (job key) is saved within the trigger itself. A concurrent trigger 
> will bail if a kill followup is still in progress (token is set AND no entry 
> in `killFollowups` set exists yet).
> 
> #Results
> The above approach allowed reducing the number of cron threads to 10 and 
> likely can be reduced even further. See https://reviews.apache.org/r/51759 
> for the lock contention results.
> 
> 
> Diffs
> -
> 
>   commons/src/main/java/org/apache/aurora/common/util/BackoffHelper.java 
> 8e73dd9ebc43e06f696bbdac4d658e4b225e7df7 
>   commons/src/test/java/org/apache/aurora/common/util/BackoffHelperTest.java 
> bc30990d57f444f7d64805ed85c363f1302736d0 
>   config/findbugs/excludeFilter.xml fe3f4ca5db1484124af14421a3349950dfec8519 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
> c07551e94f9221b5b21c5dc9715e82caa290c2e8 
>   src/main/java/org/apache/aurora/scheduler/cron/quartz/CronModule.java 
> 155d702d68367b247dd066f773c662407f0e3b5b 
>   
> src/test/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJobTest.java 
> 5c64ff2994e200b3453603ac5470e8e152cebc55 
>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
> 1c0a3fa84874d7bc185b78f13d2664cb4d8dd72f 
> 
> Diff: https://reviews.apache.org/r/51763/diff/
> 
> 
> Testing
> ---
> 
> All types of testing including deploying to test and production clusters.
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>



Re: Review Request 51899: Ensure shell health checkers running for tasks running under an isolated fileystem are run within that filesystem.

2016-09-15 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51899/#review149099
---


Ship it!




Ship It!


src/main/python/apache/aurora/executor/common/health_checker.py (line 265)


nit on the name of `isolator`: `isolator` is already a well-defined concept 
within Mesos, and it seems to me that this is not related to that. Maybe 
consider naming this as `wrapped_fn`?


- Zhitao Li


On Sept. 15, 2016, 3:15 p.m., Joshua Cohen wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51899/
> ---
> 
> (Updated Sept. 15, 2016, 3:15 p.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zhitao Li.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Ensure shell health checkers running for tasks running under an isolated 
> fileystem are run within that filesystem.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/common/health_check/shell.py 
> 35750823553406a96282545066f1291c20347ffa 
>   src/main/python/apache/aurora/executor/bin/thermos_executor_main.py 
> 5211f28e4e6c0efd29d7d79058128adb71ec7da8 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/thermos/common/BUILD 
> 879b812b6a262d6e13b64e662999dd436f039748 
>   src/main/python/apache/thermos/common/process_util.py PRE-CREATION 
>   src/main/python/apache/thermos/core/process.py 
> 2134d4ff05861d4eaee9bc7ea4763e76ce63288c 
>   src/test/python/apache/aurora/common/health_check/test_shell.py 
> 011464cbe1df00f2a56d4690176e7c2d0d3fd535 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/sh/org/apache/aurora/e2e/http/http_example.aurora 
> 290627f8bc38d31ae123cfd1cdd36e9291c2de18 
> 
> Diff: https://reviews.apache.org/r/51899/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> e2e tests
> 
> 
> Thanks,
> 
> Joshua Cohen
> 
>



Re: Review Request 51924: Remove --release-threshold option from aurora job restart.

2016-09-15 Thread Maxim Khutornenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51924/#review149097
---


Ship it!




Ship It!

- Maxim Khutornenko


On Sept. 15, 2016, 6:13 p.m., Joshua Cohen wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51924/
> ---
> 
> (Updated Sept. 15, 2016, 6:13 p.m.)
> 
> 
> Review request for Aurora and Maxim Khutornenko.
> 
> 
> Bugs: AURORA-1681
> https://issues.apache.org/jira/browse/AURORA-1681
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Remove --release-threshold option from aurora job restart.
> 
> 
> Diffs
> -
> 
>   RELEASE-NOTES.md ad2c68a6defe07c94480d7dee5b1496b50dc34e5 
>   src/main/python/apache/aurora/client/cli/jobs.py 
> 7b4c2692334acfddb53a52a602a5f07e94b4bd86 
> 
> Diff: https://reviews.apache.org/r/51924/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Joshua Cohen
> 
>



Review Request 51924: Remove --release-threshold option from aurora job restart.

2016-09-15 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51924/
---

Review request for Aurora and Maxim Khutornenko.


Bugs: AURORA-1681
https://issues.apache.org/jira/browse/AURORA-1681


Repository: aurora


Description
---

Remove --release-threshold option from aurora job restart.


Diffs
-

  RELEASE-NOTES.md ad2c68a6defe07c94480d7dee5b1496b50dc34e5 
  src/main/python/apache/aurora/client/cli/jobs.py 
7b4c2692334acfddb53a52a602a5f07e94b4bd86 

Diff: https://reviews.apache.org/r/51924/diff/


Testing
---


Thanks,

Joshua Cohen



Re: Review Request 51899: Ensure shell health checkers running for tasks running under an isolated fileystem are run within that filesystem.

2016-09-15 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51899/#review149066
---


Ship it!




Patch LGTM.

In general, I don't like the trend that Thermos is growing in complexity with 
multiple different places worrying about setuid, fs isolation, etc.  We should 
have an eye on this so that we don't get slowed down by too much complexity and 
bugs in the future.

- Stephan Erb


On Sept. 15, 2016, 5:15 p.m., Joshua Cohen wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51899/
> ---
> 
> (Updated Sept. 15, 2016, 5:15 p.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zhitao Li.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Ensure shell health checkers running for tasks running under an isolated 
> fileystem are run within that filesystem.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/common/health_check/shell.py 
> 35750823553406a96282545066f1291c20347ffa 
>   src/main/python/apache/aurora/executor/bin/thermos_executor_main.py 
> 5211f28e4e6c0efd29d7d79058128adb71ec7da8 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/thermos/common/BUILD 
> 879b812b6a262d6e13b64e662999dd436f039748 
>   src/main/python/apache/thermos/common/process_util.py PRE-CREATION 
>   src/main/python/apache/thermos/core/process.py 
> 2134d4ff05861d4eaee9bc7ea4763e76ce63288c 
>   src/test/python/apache/aurora/common/health_check/test_shell.py 
> 011464cbe1df00f2a56d4690176e7c2d0d3fd535 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/sh/org/apache/aurora/e2e/http/http_example.aurora 
> 290627f8bc38d31ae123cfd1cdd36e9291c2de18 
> 
> Diff: https://reviews.apache.org/r/51899/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> e2e tests
> 
> 
> Thanks,
> 
> Joshua Cohen
> 
>



Re: Review Request 51899: Ensure shell health checkers running for tasks running under an isolated fileystem are run within that filesystem.

2016-09-15 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51899/#review149065
---


Ship it!




Master (5069f93) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 15, 2016, 3:15 p.m., Joshua Cohen wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51899/
> ---
> 
> (Updated Sept. 15, 2016, 3:15 p.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zhitao Li.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Ensure shell health checkers running for tasks running under an isolated 
> fileystem are run within that filesystem.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/common/health_check/shell.py 
> 35750823553406a96282545066f1291c20347ffa 
>   src/main/python/apache/aurora/executor/bin/thermos_executor_main.py 
> 5211f28e4e6c0efd29d7d79058128adb71ec7da8 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/thermos/common/BUILD 
> 879b812b6a262d6e13b64e662999dd436f039748 
>   src/main/python/apache/thermos/common/process_util.py PRE-CREATION 
>   src/main/python/apache/thermos/core/process.py 
> 2134d4ff05861d4eaee9bc7ea4763e76ce63288c 
>   src/test/python/apache/aurora/common/health_check/test_shell.py 
> 011464cbe1df00f2a56d4690176e7c2d0d3fd535 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/sh/org/apache/aurora/e2e/http/http_example.aurora 
> 290627f8bc38d31ae123cfd1cdd36e9291c2de18 
> 
> Diff: https://reviews.apache.org/r/51899/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> e2e tests
> 
> 
> Thanks,
> 
> Joshua Cohen
> 
>



Re: Review Request 51899: Ensure shell health checkers running for tasks running under an isolated fileystem are run within that filesystem.

2016-09-15 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51899/
---

(Updated Sept. 15, 2016, 3:15 p.m.)


Review request for Aurora, Stephan Erb and Zhitao Li.


Changes
---

Review feedback.


Repository: aurora


Description
---

Ensure shell health checkers running for tasks running under an isolated 
fileystem are run within that filesystem.


Diffs (updated)
-

  src/main/python/apache/aurora/common/health_check/shell.py 
35750823553406a96282545066f1291c20347ffa 
  src/main/python/apache/aurora/executor/bin/thermos_executor_main.py 
5211f28e4e6c0efd29d7d79058128adb71ec7da8 
  src/main/python/apache/aurora/executor/common/health_checker.py 
5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
  src/main/python/apache/thermos/common/BUILD 
879b812b6a262d6e13b64e662999dd436f039748 
  src/main/python/apache/thermos/common/process_util.py PRE-CREATION 
  src/main/python/apache/thermos/core/process.py 
2134d4ff05861d4eaee9bc7ea4763e76ce63288c 
  src/test/python/apache/aurora/common/health_check/test_shell.py 
011464cbe1df00f2a56d4690176e7c2d0d3fd535 
  src/test/python/apache/aurora/executor/common/test_health_checker.py 
bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
  src/test/sh/org/apache/aurora/e2e/http/http_example.aurora 
290627f8bc38d31ae123cfd1cdd36e9291c2de18 

Diff: https://reviews.apache.org/r/51899/diff/


Testing
---

./build-support/jenkins/build.sh
e2e tests


Thanks,

Joshua Cohen



Re: Review Request 51899: Ensure shell health checkers running for tasks running under an isolated fileystem are run within that filesystem.

2016-09-15 Thread Joshua Cohen


> On Sept. 15, 2016, 1:23 p.m., Stephan Erb wrote:
> > src/main/python/apache/aurora/common/health_check/shell.py, line 73
> > 
> >
> > You pass in `_cmd` and `_isolator_fn` into the `ShellHealthCheck`. Have 
> > you considered just passing the result of `isolator_fn(cmd)` into the 
> > `ShellHealthCheck`? Or is there a reason that this can only be done in 
> > `__call__`?

I need the original command available so we don't leak the `mesos-containerizer 
launch ...` wrapper in the event of a health check failure. I originally just 
passed both the `cmd` and the `wrapped_cmd` but it felt like a very strange 
interface. "Create a `ShellHealthCheck` to run this `cmd` unless I specified 
`wrapped_cmd` then run that, but still use `cmd` when reporting errors."

If you feel strongly I can change it back to that. That said, you're right that 
we don't need to execute `isolator_fn` in `call`. I updated it a bit to clean 
that usage up.


> On Sept. 15, 2016, 1:23 p.m., Stephan Erb wrote:
> > src/main/python/apache/thermos/common/process_util.py, lines 24-26
> > 
> >
> > Sorry for bringing up that old thing again :-)
> > 
> > I belive this is only needed because we set 'shell:true' below. Have 
> > you tested what happens if we 'shell:false'?
> > 
> > When looking th `ps tree` one sees that we run `sh -c bash -c ''` 
> > so that first sh is completely useless and we could probably eliminate it 
> > by setting shell to false.

I think even if we set `shell: false` we'd still need to wrap it in a `bash -c` 
invocation (because we've always been clear that Process cmdlines are 
explciitly *bash* command lines), and it's the bash wrapper that causes us to 
need to escape the quotes, not the `sh ` wrapper from mesos-containerizer.

I tried a few combinations of `shell: false`, since we're just using it to 
launch a shell anyway it *shouldn't* be necessary. However I wasn't able to get 
it to successfully launch a command. It might be possible, but I'm not sure 
it's worth the effort to suss out the right incantation of splitting the 
command to get it to work. I've add a TODO to investigate if/when time allows.


- Joshua


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51899/#review149056
---


On Sept. 14, 2016, 8:49 p.m., Joshua Cohen wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51899/
> ---
> 
> (Updated Sept. 14, 2016, 8:49 p.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zhitao Li.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Ensure shell health checkers running for tasks running under an isolated 
> fileystem are run within that filesystem.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/common/health_check/shell.py 
> 35750823553406a96282545066f1291c20347ffa 
>   src/main/python/apache/aurora/executor/bin/thermos_executor_main.py 
> 5211f28e4e6c0efd29d7d79058128adb71ec7da8 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/thermos/common/BUILD 
> 879b812b6a262d6e13b64e662999dd436f039748 
>   src/main/python/apache/thermos/common/process_util.py PRE-CREATION 
>   src/main/python/apache/thermos/core/process.py 
> 2134d4ff05861d4eaee9bc7ea4763e76ce63288c 
>   src/test/python/apache/aurora/common/health_check/test_shell.py 
> 011464cbe1df00f2a56d4690176e7c2d0d3fd535 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/sh/org/apache/aurora/e2e/http/http_example.aurora 
> 290627f8bc38d31ae123cfd1cdd36e9291c2de18 
> 
> Diff: https://reviews.apache.org/r/51899/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> e2e tests
> 
> 
> Thanks,
> 
> Joshua Cohen
> 
>



Re: Review Request 51899: Ensure shell health checkers running for tasks running under an isolated fileystem are run within that filesystem.

2016-09-15 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51899/#review149056
---




src/main/python/apache/aurora/common/health_check/shell.py (line 73)


You pass in `_cmd` and `_isolator_fn` into the `ShellHealthCheck`. Have you 
considered just passing the result of `isolator_fn(cmd)` into the 
`ShellHealthCheck`? Or is there a reason that this can only be done in 
`__call__`?



src/main/python/apache/thermos/common/process_util.py (lines 24 - 26)


Sorry for bringing up that old thing again :-)

I belive this is only needed because we set 'shell:true' below. Have you 
tested what happens if we 'shell:false'?

When looking th `ps tree` one sees that we run `sh -c bash -c ''` so 
that first sh is completely useless and we could probably eliminate it by 
setting shell to false.



src/main/python/apache/thermos/common/process_util.py (line 34)


That is the line I was talking about.


- Stephan Erb


On Sept. 14, 2016, 10:49 p.m., Joshua Cohen wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51899/
> ---
> 
> (Updated Sept. 14, 2016, 10:49 p.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zhitao Li.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Ensure shell health checkers running for tasks running under an isolated 
> fileystem are run within that filesystem.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/common/health_check/shell.py 
> 35750823553406a96282545066f1291c20347ffa 
>   src/main/python/apache/aurora/executor/bin/thermos_executor_main.py 
> 5211f28e4e6c0efd29d7d79058128adb71ec7da8 
>   src/main/python/apache/aurora/executor/common/health_checker.py 
> 5fc845eceac6f0c048d7489fdc4c672b0c609ea0 
>   src/main/python/apache/thermos/common/BUILD 
> 879b812b6a262d6e13b64e662999dd436f039748 
>   src/main/python/apache/thermos/common/process_util.py PRE-CREATION 
>   src/main/python/apache/thermos/core/process.py 
> 2134d4ff05861d4eaee9bc7ea4763e76ce63288c 
>   src/test/python/apache/aurora/common/health_check/test_shell.py 
> 011464cbe1df00f2a56d4690176e7c2d0d3fd535 
>   src/test/python/apache/aurora/executor/common/test_health_checker.py 
> bb6ea69dd94298c5b8cf4d5f06d06eea7790d66e 
>   src/test/sh/org/apache/aurora/e2e/http/http_example.aurora 
> 290627f8bc38d31ae123cfd1cdd36e9291c2de18 
> 
> Diff: https://reviews.apache.org/r/51899/diff/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> e2e tests
> 
> 
> Thanks,
> 
> Joshua Cohen
> 
>



Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.

2016-09-15 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51759/#review149044
---


Ship it!




Overall lgtm. Agree w/ Zameer that we should ship all three of these tickets 
together though.


src/main/java/org/apache/aurora/scheduler/BatchWorker.java (lines 166 - 167)


super nitpicky: mind swapping the order of these args to keep inline with 
`execute(Work work)`?


- Joshua Cohen


On Sept. 14, 2016, 10:41 p.m., Maxim Khutornenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51759/
> ---
> 
> (Updated Sept. 14, 2016, 10:41 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This is the first (out of 3) patches intending to reduce storage write lock 
> contention and as such improve overall system write throughput. It introduces 
> the `BatchWorker` and migrates the majority of storage writes due to task 
> status change events to use `TaskEventBatchWorker`.
> 
> #Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` 
> transaction isolation level in SQL terms. This means all writes require 
> exclusive access to the storage and no two transactions can happen in 
> parallel [1]. While it certainly simplifies our implementation, it creates a 
> single hotspot where multiple threads are competing for the storage write 
> access. This type of contention only worsens as the cluster size grows, more 
> tasks are scheduled, more status updates are processed, more subscribers are 
> listening to status updates and etc. Eventually, the scheduler throughput 
> (and especially task scheduling) becomes degraded to the extent that certain 
> operations wait much longer (4x and more) for the lock acquisition than it 
> takes to process their payload when inside the transaction. Some ops (like 
> event processing) are generally tolerant of these types of delays. Others - 
> not as much. The task scheduling suffers the most as backing up the 
> scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> 
> #Remediation
> Given the above, it's natural to assume that reducing the number of write 
> transactions should help reducing the lock contention. This patch introduces 
> a generic `BatchWorker` service that delivers a "best effort" batching 
> approach by redirecting multiple individual write requests into a single FIFO 
> queue served non-stop by a single dedicated thread. Every batch shares a 
> single write transaction thus reducing the number of potential write lock 
> requests. To minimize wait-in-queue time, items are dispatched immediately 
> and the max number of items is bounded. There are a few `BatchWorker` 
> instances specialized on particular workload types: task even processing, 
> cron scheduling and task scheduling. Every instance can be tuned 
> independently (max batch size) and provides specialized metrics helping to 
> monitor each workload type perf.
> 
> #Results
> The proposed approach has been heavily tested in production and delivered the 
> best results. The lock contention latencies got down between 2x and 5x 
> depending on the cluster load. A number of other approaches tried but 
> discarded as not performing well or even performing much worse than the 
> current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached 
> OR a timeout expires
> - Various combinations of the above - some `BatchWorkers` are using 
> clock-driven execution while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call 
> sites are notified of item completion via a `BatchWorkCompleted` event
> 
> Happy to provide more details on the above if interested.
> 
> #Upcoming
> The introduction of the `BatchWorker` by itself was not enough to 
> substantially improve the MTTA. It, however, paves the way for the next phase 
> of scheduling perf improvement - taking more than 1 task from a given 
> `TaskGroup` in a single scheduling round (coming soon). That improvement 
> wouldn't deliver without decreasing the lock contention first. 
> 
> Note: it wasn't easy to have a clean diff split, so some functionality in 
> `BatchWorker` (e.g.: `executeWithReplay`) appears to be unused in the current 
> patch but will become obvious in the part 2 (coming out shortly).  
> 
> [1] - 
> 

Re: Review Request 51893: Allow cookie based authentication

2016-09-15 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51893/#review149031
---



Master (5069f93) is red with this patch.
  ./build-support/jenkins/build.sh

  Running setup.py bdist_wheel for twitter.common.options: finished with status 
'done'
  Stored in directory: 
/home/jenkins/jenkins-slave/workspace/AuroraBot/.home/.cache/pip/wheels/60/c2/40/54b323809df9598cc125f02527f93ff743cd9bd979f4a1737d
Successfully built pantsbuild.pants ansicolors setproctitle 
twitter.common.collections pathspec twitter.common.dirutil pystache scandir 
psutil pywatchman Markdown Pygments docutils twitter.common.confluence coverage 
pytest pytest-cov lmdb twitter.common.lang twitter.common.log cov-core 
twitter.common.options
Installing collected packages: ansicolors, setproctitle, twitter.common.lang, 
twitter.common.collections, six, pathspec, twitter.common.dirutil, requests, 
pystache, scandir, psutil, pywatchman, futures, setuptools, pex, Markdown, 
Pygments, docutils, twitter.common.options, twitter.common.log, 
twitter.common.confluence, monotonic, fasteners, coverage, py, pytest, 
cov-core, pytest-cov, lmdb, pantsbuild.pants
  Found existing installation: setuptools 21.2.1
Uninstalling setuptools-21.2.1:
  Successfully uninstalled setuptools-21.2.1
Successfully installed Markdown-2.1.1 Pygments-1.4 ansicolors-1.0.2 
cov-core-1.15.0 coverage-3.7.1 docutils-0.12 fasteners-0.14.1 futures-3.0.5 
lmdb-0.89 monotonic-1.2 pantsbuild.pants-1.1.0rc7 pathspec-0.3.4 pex-1.1.10 
psutil-4.3.0 py-1.4.31 pystache-0.5.3 pytest-2.6.4 pytest-cov-1.8.1 
pywatchman-1.3.0 requests-2.5.3 scandir-1.2 setproctitle-1.1.10 
setuptools-5.4.1 six-1.10.0 twitter.common.collections-0.3.7 
twitter.common.confluence-0.3.7 twitter.common.dirutil-0.3.7 
twitter.common.lang-0.3.7 twitter.common.log-0.3.7 twitter.common.options-0.3.7

07:27:04 00:00 [main]
   (To run a reporting server: ./pants server)
07:27:04 00:00   [setup]
07:27:04 00:00 [parse]
   Executing tasks in goals: compile
07:27:05 00:01   [compile]
07:27:05 00:01 [compile-prep-command]
07:27:05 00:01 [compile]
07:27:05 00:01 [python-eval]
07:27:05 00:01 [pythonstyle]
07:27:05 00:01   [cache]  
   No cached artifacts for 42 targets.
   Invalidated 42 targets.
F401:ERROR   src/main/python/apache/aurora/client/api/scheduler_client.py:017 
'sys' imported but unused
 |import sys

T001:ERROR   src/main/python/apache/aurora/client/api/scheduler_client.py:054 
Class globals must be UPPER_SNAKE_CASED
 |  cookie_jar= String  #noqa

E221:ERROR   
PythonFile(src/main/python/apache/aurora/client/api/scheduler_client.py):054 
multiple spaces before operator
 |  cookie_jar= String  #noqa

E262:ERROR   
PythonFile(src/main/python/apache/aurora/client/api/scheduler_client.py):054 
inline comment should start with '# '
 |  cookie_jar= String  #noqa


FAILURE: 4 Python Style issues found


07:27:24 00:20   [complete]
   FAILURE


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Sept. 15, 2016, 7:17 a.m., Giulio Eulisse wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51893/
> ---
> 
> (Updated Sept. 15, 2016, 7:17 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen and WarnerSM WarnerSM.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This allows aurora client to connect to servers which are behind a
> frontend which expects some sort of cookie to autheticate and authorize
> users.
> 
> A cookie_jar option can be specified in the `~/.aurora/clusters.json`
> to specify a file where the cookie jar is located. Such a cookie jar,
> in MozillanCookieJar format, will be used to create the session and
> therefore all the subsequent requests will use it.
> 
> 
> Diffs
> -
> 
>   src/main/python/apache/aurora/client/api/scheduler_client.py 
> cbdb50ae409b70a35a03405f969d02a6145c9c53 
> 
> Diff: https://reviews.apache.org/r/51893/diff/
> 
> 
> Testing
> ---
> 
> $ cat ~/aurora/clusters.json
> [
> {
>   "name": "build",
>   "scheduler_uri": "https://aliaurora.cern.ch;,
>   "auth_mechanism": "UNAUTHENTICATED",
>   "cookie_jar": "~/.aurora-token"
> }
> ]
> $ dist/aurora.pex quota get build/root
> 
> 
> Thanks,
> 
> Giulio Eulisse
> 
>



Re: Review Request 51893: Allow cookie based authentication

2016-09-15 Thread Giulio Eulisse

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51893/
---

(Updated Sept. 15, 2016, 7:17 a.m.)


Review request for Aurora, Joshua Cohen and WarnerSM WarnerSM.


Changes
---

@ReviewBot retry


Repository: aurora


Description
---

This allows aurora client to connect to servers which are behind a
frontend which expects some sort of cookie to autheticate and authorize
users.

A cookie_jar option can be specified in the `~/.aurora/clusters.json`
to specify a file where the cookie jar is located. Such a cookie jar,
in MozillanCookieJar format, will be used to create the session and
therefore all the subsequent requests will use it.


Diffs (updated)
-

  src/main/python/apache/aurora/client/api/scheduler_client.py 
cbdb50ae409b70a35a03405f969d02a6145c9c53 

Diff: https://reviews.apache.org/r/51893/diff/


Testing
---

$ cat ~/aurora/clusters.json
[
{
  "name": "build",
  "scheduler_uri": "https://aliaurora.cern.ch;,
  "auth_mechanism": "UNAUTHENTICATED",
  "cookie_jar": "~/.aurora-token"
}
]
$ dist/aurora.pex quota get build/root


Thanks,

Giulio Eulisse