Re: Review Request 44926: [DRAFT] Auto-retry on failure during RU/EU

2016-03-19 Thread Alejandro Fernandez

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44926/#review124117
---




ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleCommand.java
 (line 61)


Today, startTime is allowed to be changed to -1 after it already had value. 
For this reason, I added a column called originalStartTime that can only be set 
once.



ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
 (line 51)


Switched to guava service :-)



ambari-server/src/main/java/org/apache/ambari/server/configuration/Configuration.java
 (line 1123)


Feature will be turned off by default.



ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog240.java
 (line 643)


Added column on ambari upgrade.


- Alejandro Fernandez


On March 17, 2016, 11:07 p.m., Alejandro Fernandez wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/44926/
> ---
> 
> (Updated March 17, 2016, 11:07 p.m.)
> 
> 
> Review request for Ambari, Jonathan Hurley and Nate Cole.
> 
> 
> Bugs: AMBARI-15446
> https://issues.apache.org/jira/browse/AMBARI-15446
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED 
> or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. 
> This is useful when a host goes down as Ambari is running a task on it.
> ambari.properties will have 1 new parameter. E.g,. 
> stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)
> If Ambari Server is restarted, it should be able to recover.
> Today, Action Scheduler increases the attempt_count whenever a task is 
> retried, but it requires resetting the start_time to -1. Because of this, we 
> cannot rely on the start_time property to know when to timeout after several 
> retries.
> 
> For the implementation, will add another thread to Ambari that will monitor 
> failed tasks only during active RU/EU and change the status back to PENDING 
> so that Action Scheduler can reschedule it.
> Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are 
> blocking, so no other stages are allowed to proceed.
> In order to know when a task was first started, will add a new property to 
> host_role_command table called original_start_time.
> 
> For the agents, we need to ensure that they always write out a response. On 
> the first heartbeat, it should send the status of its last command so we know 
> it failed and Ambari can retry.
> 
> 
> Diffs
> -
> 
>   
> ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java
>  429f573 
>   
> ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleCommand.java
>  2764b3f 
>   
> ambari-server/src/main/java/org/apache/ambari/server/agent/HeartBeatHandler.java
>  3a80803 
>   
> ambari-server/src/main/java/org/apache/ambari/server/agent/HeartbeatProcessor.java
>  a1a686a 
>   
> ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
>  PRE-CREATION 
>   
> ambari-server/src/main/java/org/apache/ambari/server/configuration/Configuration.java
>  9404506 
>   
> ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java
>  f5b1cb4 
>   
> ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java
>  19f0602 
>   
> ambari-server/src/main/java/org/apache/ambari/server/topology/HostRequest.java
>  9eb514a 
>   
> ambari-server/src/main/java/org/apache/ambari/server/topology/LogicalRequest.java
>  82edbcf 
>   
> ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog240.java
>  a803f73 
>   ambari-server/src/main/resources/Ambari-DDL-MySQL-CREATE.sql 9b4810c 
>   ambari-server/src/main/resources/Ambari-DDL-Oracle-CREATE.sql cc3d197 
>   ambari-server/src/main/resources/Ambari-DDL-Postgres-CREATE.sql 07c786d 
>   ambari-server/src/main/resources/Ambari-DDL-Postgres-EMBEDDED-CREATE.sql 
> ab6dc93 
>   ambari-server/src/main/resources/Ambari-DDL-SQLAnywhere-CREATE.sql 8e91fde 
>   ambari-server/src/main/resources/Ambari-DDL-SQLServer-CREATE.sql 440ca44 
> 
> Diff: https://reviews.apache.org/r/44926/diff/
> 
> 
> Testing
> ---
> 
> Verified on a live cluster.
> 
> TODO: Still need to make more changes to the implementation, add the config, 
> switch to gauva service, add a column, and add unit tests.
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>



Re: Review Request 44926: [DRAFT] Auto-retry on failure during RU/EU

2016-03-19 Thread Jonathan Hurley

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44926/#review124202
---




ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleCommand.java
 (lines 334 - 340)


Documentation.



ambari-server/src/main/java/org/apache/ambari/server/agent/HeartBeatHandler.java
 (line 130)


Get rid of import.



ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
 (line 51)


This class is really scoped for stack upgrades. Yet it's name doesn't 
indicate that nor is it in an upgrade package.

Thoughts on changing the name or moving its package?



ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
 (line 72)


Comment on why these are skipped (for those of us who don't have upgrade 
blood in our veins)



ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
 (line 84)


Consistency in name with `m_`



ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
 (line 89)


Consistency in name with `m_`



ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
 (line 105)


Expose via Configuration & ambari.properties



ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
 (line 119)


Log.info that it's not going to run, and why.



ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
 (line 133)


Use {} for log statements, especially debug ones.



ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
 (line 178)


Can we fail fast for things like this?

```
if( null == requestId )
  return
```

Makes multiple if/for/if indents easier to read.



ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
 (line 180)


Doesn't the contract ensure a non-null collection? You can eliminate this 
if-statement in that case.



ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
 (line 207)


Doc.



ambari-server/src/main/java/org/apache/ambari/server/agent/RetryActionMonitor.java
 (line 225)


Doc



ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java
 (lines 317 - 320)


Can you use named parameters here and make this a NamedQuery?


- Jonathan Hurley


On March 17, 2016, 7:07 p.m., Alejandro Fernandez wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/44926/
> ---
> 
> (Updated March 17, 2016, 7:07 p.m.)
> 
> 
> Review request for Ambari, Jonathan Hurley and Nate Cole.
> 
> 
> Bugs: AMBARI-15446
> https://issues.apache.org/jira/browse/AMBARI-15446
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED 
> or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. 
> This is useful when a host goes down as Ambari is running a task on it.
> ambari.properties will have 1 new parameter. E.g,. 
> stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)
> If Ambari Server is restarted, it should be able to recover.
> Today, Action Scheduler increases the attempt_count whenever a task is 
> retried, but it requires resetting the start_time to -1. Because of this, we 
> cannot rely on the start_time property to know when to timeout after several 
> retries.
> 
> For the implementation, will add another thread to Ambari that will monitor 
> failed tasks only during active RU/EU and change the status back to PENDING 
> so that Action Scheduler can reschedule it.
> Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are 
> blocking, so no other stages are allowed to proceed.
> In order to know when a task was first started, will add a new property to 
> host_role_command table called original_start_time.
> 
> For the agents, we need to ensure that