Re: Review Request 65338: Fix error handling logic for launch failures

2018-01-25 Thread Jordan Ly

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65338/#review196275
---


Ship it!




Ship It!

- Jordan Ly


On Jan. 25, 2018, 6:21 p.m., David McLaughlin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65338/
> ---
> 
> (Updated Jan. 25, 2018, 6:21 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.
> 
> 
> Bugs: AURORA-1965
> https://issues.apache.org/jira/browse/AURORA-1965
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Discovered while debugging https://issues.apache.org/jira/browse/AURORA-1966. 
> Before we attempt to launch a task, we  move the task to ASSIGNED state. 
> However, the code to deal with launch failures expects the task to be in 
> PENDING state. So the ASSIGNED -> LOST state change fails, and instead we 
> rely on the transient task timeout for correctness. This means errors that 
> can be recovered from in seconds instead take at least five minutes (by 
> default).
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java 
> 916908bbf635a261c01777cd3a357ca457dd9726 
>   
> src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java
>  533bb44953163e2148fa18c394a4338938dae205 
> 
> 
> Diff: https://reviews.apache.org/r/65338/diff/1/
> 
> 
> Testing
> ---
> 
> ./gradlew test
> 
> Also tested in Vagrant.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>



Re: Review Request 65338: Fix error handling logic for launch failures

2018-01-25 Thread David McLaughlin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65338/
---

(Updated Jan. 25, 2018, 6:21 p.m.)


Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.


Bugs: AURORA-1965
https://issues.apache.org/jira/browse/AURORA-1965


Repository: aurora


Description
---

Discovered while debugging https://issues.apache.org/jira/browse/AURORA-1966. 
Before we attempt to launch a task, we  move the task to ASSIGNED state. 
However, the code to deal with launch failures expects the task to be in 
PENDING state. So the ASSIGNED -> LOST state change fails, and instead we rely 
on the transient task timeout for correctness. This means errors that can be 
recovered from in seconds instead take at least five minutes (by default).


Diffs
-

  src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java 
916908bbf635a261c01777cd3a357ca457dd9726 
  
src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 
533bb44953163e2148fa18c394a4338938dae205 


Diff: https://reviews.apache.org/r/65338/diff/1/


Testing
---

./gradlew test

Also tested in Vagrant.


Thanks,

David McLaughlin



Re: Review Request 65338: Fix error handling logic for launch failures

2018-01-25 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65338/#review196273
---


Ship it!




Please link to - https://issues.apache.org/jira/browse/AURORA-1965 and resolve.

- Santhosh Kumar Shanmugham


On Jan. 25, 2018, 12:43 a.m., David McLaughlin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65338/
> ---
> 
> (Updated Jan. 25, 2018, 12:43 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Discovered while debugging https://issues.apache.org/jira/browse/AURORA-1966. 
> Before we attempt to launch a task, we  move the task to ASSIGNED state. 
> However, the code to deal with launch failures expects the task to be in 
> PENDING state. So the ASSIGNED -> LOST state change fails, and instead we 
> rely on the transient task timeout for correctness. This means errors that 
> can be recovered from in seconds instead take at least five minutes (by 
> default).
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java 
> 916908bbf635a261c01777cd3a357ca457dd9726 
>   
> src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java
>  533bb44953163e2148fa18c394a4338938dae205 
> 
> 
> Diff: https://reviews.apache.org/r/65338/diff/1/
> 
> 
> Testing
> ---
> 
> ./gradlew test
> 
> Also tested in Vagrant.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>



Re: Review Request 65338: Fix error handling logic for launch failures

2018-01-25 Thread David McLaughlin


> On Jan. 25, 2018, 2 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java
> > Line 135 (original), 135 (patched)
> > 
> >
> > To prevent future mistakes of the same kind, should we remove the 
> > `casState` state here and just transition to `LOST` unconditionally?
> > 
> > I cannot think of a scenario right now where launching would fail, but 
> > we would still like the task to live on.

The CAS is to avoid race conditions (e.g. executing this block twice and trying 
to move from LOST to LOST). So I think it's a good idea to keep it?


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65338/#review196248
---


On Jan. 25, 2018, 8:43 a.m., David McLaughlin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65338/
> ---
> 
> (Updated Jan. 25, 2018, 8:43 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Discovered while debugging https://issues.apache.org/jira/browse/AURORA-1966. 
> Before we attempt to launch a task, we  move the task to ASSIGNED state. 
> However, the code to deal with launch failures expects the task to be in 
> PENDING state. So the ASSIGNED -> LOST state change fails, and instead we 
> rely on the transient task timeout for correctness. This means errors that 
> can be recovered from in seconds instead take at least five minutes (by 
> default).
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java 
> 916908bbf635a261c01777cd3a357ca457dd9726 
>   
> src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java
>  533bb44953163e2148fa18c394a4338938dae205 
> 
> 
> Diff: https://reviews.apache.org/r/65338/diff/1/
> 
> 
> Testing
> ---
> 
> ./gradlew test
> 
> Also tested in Vagrant.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>



Re: Review Request 65338: Fix error handling logic for launch failures

2018-01-25 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65338/#review196248
---




src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java
Line 135 (original), 135 (patched)


To prevent future mistakes of the same kind, should we remove the 
`casState` state here and just transition to `LOST` unconditionally?

I cannot think of a scenario right now where launching would fail, but we 
would still like the task to live on.


- Stephan Erb


On Jan. 25, 2018, 9:43 a.m., David McLaughlin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65338/
> ---
> 
> (Updated Jan. 25, 2018, 9:43 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Discovered while debugging https://issues.apache.org/jira/browse/AURORA-1966. 
> Before we attempt to launch a task, we  move the task to ASSIGNED state. 
> However, the code to deal with launch failures expects the task to be in 
> PENDING state. So the ASSIGNED -> LOST state change fails, and instead we 
> rely on the transient task timeout for correctness. This means errors that 
> can be recovered from in seconds instead take at least five minutes (by 
> default).
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java 
> 916908bbf635a261c01777cd3a357ca457dd9726 
>   
> src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java
>  533bb44953163e2148fa18c394a4338938dae205 
> 
> 
> Diff: https://reviews.apache.org/r/65338/diff/1/
> 
> 
> Testing
> ---
> 
> ./gradlew test
> 
> Also tested in Vagrant.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>



Re: Review Request 65338: Fix error handling logic for launch failures

2018-01-25 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65338/#review196226
---


Ship it!




Master (dbe7137) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Jan. 25, 2018, 8:43 a.m., David McLaughlin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65338/
> ---
> 
> (Updated Jan. 25, 2018, 8:43 a.m.)
> 
> 
> Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Discovered while debugging https://issues.apache.org/jira/browse/AURORA-1966. 
> Before we attempt to launch a task, we  move the task to ASSIGNED state. 
> However, the code to deal with launch failures expects the task to be in 
> PENDING state. So the ASSIGNED -> LOST state change fails, and instead we 
> rely on the transient task timeout for correctness. This means errors that 
> can be recovered from in seconds instead take at least five minutes (by 
> default).
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java 
> 916908bbf635a261c01777cd3a357ca457dd9726 
>   
> src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java
>  533bb44953163e2148fa18c394a4338938dae205 
> 
> 
> Diff: https://reviews.apache.org/r/65338/diff/1/
> 
> 
> Testing
> ---
> 
> ./gradlew test
> 
> Also tested in Vagrant.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>



Review Request 65338: Fix error handling logic for launch failures

2018-01-25 Thread David McLaughlin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65338/
---

Review request for Aurora, Jordan Ly and Santhosh Kumar Shanmugham.


Repository: aurora


Description
---

Discovered while debugging https://issues.apache.org/jira/browse/AURORA-1966. 
Before we attempt to launch a task, we  move the task to ASSIGNED state. 
However, the code to deal with launch failures expects the task to be in 
PENDING state. So the ASSIGNED -> LOST state change fails, and instead we rely 
on the transient task timeout for correctness. This means errors that can be 
recovered from in seconds instead take at least five minutes (by default).


Diffs
-

  src/main/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImpl.java 
916908bbf635a261c01777cd3a357ca457dd9726 
  
src/test/java/org/apache/aurora/scheduler/scheduling/TaskAssignerImplTest.java 
533bb44953163e2148fa18c394a4338938dae205 


Diff: https://reviews.apache.org/r/65338/diff/1/


Testing
---

./gradlew test

Also tested in Vagrant.


Thanks,

David McLaughlin