Re: Review Request 44926: Auto-retry on failure during RU/EU

2016-03-21 Thread Alejandro Fernandez

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44926/
---

(Updated March 21, 2016, 8:59 p.m.)


Review request for Ambari, Jonathan Hurley and Nate Cole.


Bugs: AMBARI-15446
https://issues.apache.org/jira/browse/AMBARI-15446


Repository: ambari


Description
---

When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED 
or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. 
This is useful when a host goes down as Ambari is running a task on it.
ambari.properties will have 1 new parameter. E.g,. 
stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)
If Ambari Server is restarted, it should be able to recover.
Today, Action Scheduler increases the attempt_count whenever a task is retried, 
but it requires resetting the start_time to -1. Because of this, we cannot rely 
on the start_time property to know when to timeout after several retries.

For the implementation, will add another thread to Ambari that will monitor 
failed tasks only during active RU/EU and change the status back to PENDING so 
that Action Scheduler can reschedule it.
Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking, 
so no other stages are allowed to proceed.
In order to know when a task was first started, will add a new property to 
host_role_command table called original_start_time.

For the agents, we need to ensure that they always write out a response. On the 
first heartbeat, it should send the status of its last command so we know it 
failed and Ambari can retry.


Diffs
-

  
ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java
 429f573 
  
ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleCommand.java
 2764b3f 
  
ambari-server/src/main/java/org/apache/ambari/server/agent/HeartbeatProcessor.java
 a1a686a 
  
ambari-server/src/main/java/org/apache/ambari/server/configuration/Configuration.java
 9404506 
  
ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java
 f5b1cb4 
  
ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java
 19f0602 
  
ambari-server/src/main/java/org/apache/ambari/server/state/services/RetryUpgradeActionService.java
 PRE-CREATION 
  
ambari-server/src/main/java/org/apache/ambari/server/topology/HostRequest.java 
9eb514a 
  
ambari-server/src/main/java/org/apache/ambari/server/topology/LogicalRequest.java
 82edbcf 
  
ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog240.java
 7b83710 
  ambari-server/src/main/resources/Ambari-DDL-MySQL-CREATE.sql a07c6fc 
  ambari-server/src/main/resources/Ambari-DDL-Oracle-CREATE.sql b2b450a 
  ambari-server/src/main/resources/Ambari-DDL-Postgres-CREATE.sql cec122e 
  ambari-server/src/main/resources/Ambari-DDL-Postgres-EMBEDDED-CREATE.sql 
96fc720 
  ambari-server/src/main/resources/Ambari-DDL-SQLAnywhere-CREATE.sql c425d6f 
  ambari-server/src/main/resources/Ambari-DDL-SQLServer-CREATE.sql 2a89e26 
  
ambari-server/src/test/java/org/apache/ambari/server/state/services/RetryUpgradeActionServiceTest.java
 PRE-CREATION 

Diff: https://reviews.apache.org/r/44926/diff/


Testing (updated)
---

Verified on a live cluster.
Unit tests passed,

mvn clean package test

[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 01:11 h
[INFO] Finished at: 2016-03-21T13:15:21-07:00
[INFO] Final Memory: 139M/4054M
[INFO] 


Thanks,

Alejandro Fernandez



Re: Review Request 44926: Auto-retry on failure during RU/EU

2016-03-21 Thread Jonathan Hurley

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44926/#review124629
---


Ship it!




Ship It!

- Jonathan Hurley


On March 21, 2016, 3:03 p.m., Alejandro Fernandez wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/44926/
> ---
> 
> (Updated March 21, 2016, 3:03 p.m.)
> 
> 
> Review request for Ambari, Jonathan Hurley and Nate Cole.
> 
> 
> Bugs: AMBARI-15446
> https://issues.apache.org/jira/browse/AMBARI-15446
> 
> 
> Repository: ambari
> 
> 
> Description
> ---
> 
> When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED 
> or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. 
> This is useful when a host goes down as Ambari is running a task on it.
> ambari.properties will have 1 new parameter. E.g,. 
> stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)
> If Ambari Server is restarted, it should be able to recover.
> Today, Action Scheduler increases the attempt_count whenever a task is 
> retried, but it requires resetting the start_time to -1. Because of this, we 
> cannot rely on the start_time property to know when to timeout after several 
> retries.
> 
> For the implementation, will add another thread to Ambari that will monitor 
> failed tasks only during active RU/EU and change the status back to PENDING 
> so that Action Scheduler can reschedule it.
> Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are 
> blocking, so no other stages are allowed to proceed.
> In order to know when a task was first started, will add a new property to 
> host_role_command table called original_start_time.
> 
> For the agents, we need to ensure that they always write out a response. On 
> the first heartbeat, it should send the status of its last command so we know 
> it failed and Ambari can retry.
> 
> 
> Diffs
> -
> 
>   
> ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java
>  429f573 
>   
> ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleCommand.java
>  2764b3f 
>   
> ambari-server/src/main/java/org/apache/ambari/server/agent/HeartbeatProcessor.java
>  a1a686a 
>   
> ambari-server/src/main/java/org/apache/ambari/server/configuration/Configuration.java
>  9404506 
>   
> ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java
>  f5b1cb4 
>   
> ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java
>  19f0602 
>   
> ambari-server/src/main/java/org/apache/ambari/server/state/services/RetryUpgradeActionService.java
>  PRE-CREATION 
>   
> ambari-server/src/main/java/org/apache/ambari/server/topology/HostRequest.java
>  9eb514a 
>   
> ambari-server/src/main/java/org/apache/ambari/server/topology/LogicalRequest.java
>  82edbcf 
>   
> ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog240.java
>  7b83710 
>   ambari-server/src/main/resources/Ambari-DDL-MySQL-CREATE.sql a07c6fc 
>   ambari-server/src/main/resources/Ambari-DDL-Oracle-CREATE.sql b2b450a 
>   ambari-server/src/main/resources/Ambari-DDL-Postgres-CREATE.sql cec122e 
>   ambari-server/src/main/resources/Ambari-DDL-Postgres-EMBEDDED-CREATE.sql 
> 96fc720 
>   ambari-server/src/main/resources/Ambari-DDL-SQLAnywhere-CREATE.sql c425d6f 
>   ambari-server/src/main/resources/Ambari-DDL-SQLServer-CREATE.sql 2a89e26 
>   
> ambari-server/src/test/java/org/apache/ambari/server/state/services/RetryUpgradeActionServiceTest.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/44926/diff/
> 
> 
> Testing
> ---
> 
> Verified on a live cluster.
> New unit test passed, waiting for full set of unit test results.
> 
> 
> Thanks,
> 
> Alejandro Fernandez
> 
>



Re: Review Request 44926: Auto-retry on failure during RU/EU

2016-03-21 Thread Alejandro Fernandez

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44926/
---

(Updated March 21, 2016, 7:03 p.m.)


Review request for Ambari, Jonathan Hurley and Nate Cole.


Bugs: AMBARI-15446
https://issues.apache.org/jira/browse/AMBARI-15446


Repository: ambari


Description
---

When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED 
or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. 
This is useful when a host goes down as Ambari is running a task on it.
ambari.properties will have 1 new parameter. E.g,. 
stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)
If Ambari Server is restarted, it should be able to recover.
Today, Action Scheduler increases the attempt_count whenever a task is retried, 
but it requires resetting the start_time to -1. Because of this, we cannot rely 
on the start_time property to know when to timeout after several retries.

For the implementation, will add another thread to Ambari that will monitor 
failed tasks only during active RU/EU and change the status back to PENDING so 
that Action Scheduler can reschedule it.
Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking, 
so no other stages are allowed to proceed.
In order to know when a task was first started, will add a new property to 
host_role_command table called original_start_time.

For the agents, we need to ensure that they always write out a response. On the 
first heartbeat, it should send the status of its last command so we know it 
failed and Ambari can retry.


Diffs (updated)
-

  
ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java
 429f573 
  
ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleCommand.java
 2764b3f 
  
ambari-server/src/main/java/org/apache/ambari/server/agent/HeartbeatProcessor.java
 a1a686a 
  
ambari-server/src/main/java/org/apache/ambari/server/configuration/Configuration.java
 9404506 
  
ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java
 f5b1cb4 
  
ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java
 19f0602 
  
ambari-server/src/main/java/org/apache/ambari/server/state/services/RetryUpgradeActionService.java
 PRE-CREATION 
  
ambari-server/src/main/java/org/apache/ambari/server/topology/HostRequest.java 
9eb514a 
  
ambari-server/src/main/java/org/apache/ambari/server/topology/LogicalRequest.java
 82edbcf 
  
ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog240.java
 7b83710 
  ambari-server/src/main/resources/Ambari-DDL-MySQL-CREATE.sql a07c6fc 
  ambari-server/src/main/resources/Ambari-DDL-Oracle-CREATE.sql b2b450a 
  ambari-server/src/main/resources/Ambari-DDL-Postgres-CREATE.sql cec122e 
  ambari-server/src/main/resources/Ambari-DDL-Postgres-EMBEDDED-CREATE.sql 
96fc720 
  ambari-server/src/main/resources/Ambari-DDL-SQLAnywhere-CREATE.sql c425d6f 
  ambari-server/src/main/resources/Ambari-DDL-SQLServer-CREATE.sql 2a89e26 
  
ambari-server/src/test/java/org/apache/ambari/server/state/services/RetryUpgradeActionServiceTest.java
 PRE-CREATION 

Diff: https://reviews.apache.org/r/44926/diff/


Testing
---

Verified on a live cluster.
New unit test passed, waiting for full set of unit test results.


Thanks,

Alejandro Fernandez



Re: Review Request 44926: Auto-retry on failure during RU/EU

2016-03-21 Thread Alejandro Fernandez

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44926/
---

(Updated March 21, 2016, 6:57 p.m.)


Review request for Ambari, Jonathan Hurley and Nate Cole.


Changes
---

Addressed comments and retested.


Summary (updated)
-

Auto-retry on failure during RU/EU


Bugs: AMBARI-15446
https://issues.apache.org/jira/browse/AMBARI-15446


Repository: ambari


Description
---

When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED 
or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. 
This is useful when a host goes down as Ambari is running a task on it.
ambari.properties will have 1 new parameter. E.g,. 
stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present)
If Ambari Server is restarted, it should be able to recover.
Today, Action Scheduler increases the attempt_count whenever a task is retried, 
but it requires resetting the start_time to -1. Because of this, we cannot rely 
on the start_time property to know when to timeout after several retries.

For the implementation, will add another thread to Ambari that will monitor 
failed tasks only during active RU/EU and change the status back to PENDING so 
that Action Scheduler can reschedule it.
Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking, 
so no other stages are allowed to proceed.
In order to know when a task was first started, will add a new property to 
host_role_command table called original_start_time.

For the agents, we need to ensure that they always write out a response. On the 
first heartbeat, it should send the status of its last command so we know it 
failed and Ambari can retry.


Diffs (updated)
-

  
ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java
 429f573 
  
ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleCommand.java
 2764b3f 
  
ambari-server/src/main/java/org/apache/ambari/server/agent/HeartbeatProcessor.java
 a1a686a 
  
ambari-server/src/main/java/org/apache/ambari/server/configuration/Configuration.java
 9404506 
  
ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java
 f5b1cb4 
  
ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java
 19f0602 
  
ambari-server/src/main/java/org/apache/ambari/server/state/services/RetryUpgradeActionService.java
 PRE-CREATION 
  
ambari-server/src/main/java/org/apache/ambari/server/topology/HostRequest.java 
9eb514a 
  
ambari-server/src/main/java/org/apache/ambari/server/topology/LogicalRequest.java
 82edbcf 
  
ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog240.java
 7b83710 
  ambari-server/src/main/resources/Ambari-DDL-MySQL-CREATE.sql a07c6fc 
  ambari-server/src/main/resources/Ambari-DDL-Oracle-CREATE.sql b2b450a 
  ambari-server/src/main/resources/Ambari-DDL-Postgres-CREATE.sql cec122e 
  ambari-server/src/main/resources/Ambari-DDL-Postgres-EMBEDDED-CREATE.sql 
96fc720 
  ambari-server/src/main/resources/Ambari-DDL-SQLAnywhere-CREATE.sql c425d6f 
  ambari-server/src/main/resources/Ambari-DDL-SQLServer-CREATE.sql 2a89e26 
  
ambari-server/src/test/java/org/apache/ambari/server/state/services/RetryUpgradeActionServiceTest.java
 PRE-CREATION 

Diff: https://reviews.apache.org/r/44926/diff/


Testing (updated)
---

Verified on a live cluster.
New unit test passed, waiting for full set of unit test results.


Thanks,

Alejandro Fernandez