Re: Review Request 44926: Auto-retry on failure during RU/EU
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/44926/ --- (Updated March 21, 2016, 8:59 p.m.) Review request for Ambari, Jonathan Hurley and Nate Cole. Bugs: AMBARI-15446 https://issues.apache.org/jira/browse/AMBARI-15446 Repository: ambari Description --- When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. This is useful when a host goes down as Ambari is running a task on it. ambari.properties will have 1 new parameter. E.g,. stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present) If Ambari Server is restarted, it should be able to recover. Today, Action Scheduler increases the attempt_count whenever a task is retried, but it requires resetting the start_time to -1. Because of this, we cannot rely on the start_time property to know when to timeout after several retries. For the implementation, will add another thread to Ambari that will monitor failed tasks only during active RU/EU and change the status back to PENDING so that Action Scheduler can reschedule it. Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking, so no other stages are allowed to proceed. In order to know when a task was first started, will add a new property to host_role_command table called original_start_time. For the agents, we need to ensure that they always write out a response. On the first heartbeat, it should send the status of its last command so we know it failed and Ambari can retry. Diffs - ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java 429f573 ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleCommand.java 2764b3f ambari-server/src/main/java/org/apache/ambari/server/agent/HeartbeatProcessor.java a1a686a ambari-server/src/main/java/org/apache/ambari/server/configuration/Configuration.java 9404506 ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java f5b1cb4 ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java 19f0602 ambari-server/src/main/java/org/apache/ambari/server/state/services/RetryUpgradeActionService.java PRE-CREATION ambari-server/src/main/java/org/apache/ambari/server/topology/HostRequest.java 9eb514a ambari-server/src/main/java/org/apache/ambari/server/topology/LogicalRequest.java 82edbcf ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog240.java 7b83710 ambari-server/src/main/resources/Ambari-DDL-MySQL-CREATE.sql a07c6fc ambari-server/src/main/resources/Ambari-DDL-Oracle-CREATE.sql b2b450a ambari-server/src/main/resources/Ambari-DDL-Postgres-CREATE.sql cec122e ambari-server/src/main/resources/Ambari-DDL-Postgres-EMBEDDED-CREATE.sql 96fc720 ambari-server/src/main/resources/Ambari-DDL-SQLAnywhere-CREATE.sql c425d6f ambari-server/src/main/resources/Ambari-DDL-SQLServer-CREATE.sql 2a89e26 ambari-server/src/test/java/org/apache/ambari/server/state/services/RetryUpgradeActionServiceTest.java PRE-CREATION Diff: https://reviews.apache.org/r/44926/diff/ Testing (updated) --- Verified on a live cluster. Unit tests passed, mvn clean package test [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 01:11 h [INFO] Finished at: 2016-03-21T13:15:21-07:00 [INFO] Final Memory: 139M/4054M [INFO] Thanks, Alejandro Fernandez
Re: Review Request 44926: Auto-retry on failure during RU/EU
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/44926/#review124629 --- Ship it! Ship It! - Jonathan Hurley On March 21, 2016, 3:03 p.m., Alejandro Fernandez wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/44926/ > --- > > (Updated March 21, 2016, 3:03 p.m.) > > > Review request for Ambari, Jonathan Hurley and Nate Cole. > > > Bugs: AMBARI-15446 > https://issues.apache.org/jira/browse/AMBARI-15446 > > > Repository: ambari > > > Description > --- > > When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED > or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. > This is useful when a host goes down as Ambari is running a task on it. > ambari.properties will have 1 new parameter. E.g,. > stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present) > If Ambari Server is restarted, it should be able to recover. > Today, Action Scheduler increases the attempt_count whenever a task is > retried, but it requires resetting the start_time to -1. Because of this, we > cannot rely on the start_time property to know when to timeout after several > retries. > > For the implementation, will add another thread to Ambari that will monitor > failed tasks only during active RU/EU and change the status back to PENDING > so that Action Scheduler can reschedule it. > Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are > blocking, so no other stages are allowed to proceed. > In order to know when a task was first started, will add a new property to > host_role_command table called original_start_time. > > For the agents, we need to ensure that they always write out a response. On > the first heartbeat, it should send the status of its last command so we know > it failed and Ambari can retry. > > > Diffs > - > > > ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java > 429f573 > > ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleCommand.java > 2764b3f > > ambari-server/src/main/java/org/apache/ambari/server/agent/HeartbeatProcessor.java > a1a686a > > ambari-server/src/main/java/org/apache/ambari/server/configuration/Configuration.java > 9404506 > > ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java > f5b1cb4 > > ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java > 19f0602 > > ambari-server/src/main/java/org/apache/ambari/server/state/services/RetryUpgradeActionService.java > PRE-CREATION > > ambari-server/src/main/java/org/apache/ambari/server/topology/HostRequest.java > 9eb514a > > ambari-server/src/main/java/org/apache/ambari/server/topology/LogicalRequest.java > 82edbcf > > ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog240.java > 7b83710 > ambari-server/src/main/resources/Ambari-DDL-MySQL-CREATE.sql a07c6fc > ambari-server/src/main/resources/Ambari-DDL-Oracle-CREATE.sql b2b450a > ambari-server/src/main/resources/Ambari-DDL-Postgres-CREATE.sql cec122e > ambari-server/src/main/resources/Ambari-DDL-Postgres-EMBEDDED-CREATE.sql > 96fc720 > ambari-server/src/main/resources/Ambari-DDL-SQLAnywhere-CREATE.sql c425d6f > ambari-server/src/main/resources/Ambari-DDL-SQLServer-CREATE.sql 2a89e26 > > ambari-server/src/test/java/org/apache/ambari/server/state/services/RetryUpgradeActionServiceTest.java > PRE-CREATION > > Diff: https://reviews.apache.org/r/44926/diff/ > > > Testing > --- > > Verified on a live cluster. > New unit test passed, waiting for full set of unit test results. > > > Thanks, > > Alejandro Fernandez > >
Re: Review Request 44926: Auto-retry on failure during RU/EU
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/44926/ --- (Updated March 21, 2016, 7:03 p.m.) Review request for Ambari, Jonathan Hurley and Nate Cole. Bugs: AMBARI-15446 https://issues.apache.org/jira/browse/AMBARI-15446 Repository: ambari Description --- When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. This is useful when a host goes down as Ambari is running a task on it. ambari.properties will have 1 new parameter. E.g,. stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present) If Ambari Server is restarted, it should be able to recover. Today, Action Scheduler increases the attempt_count whenever a task is retried, but it requires resetting the start_time to -1. Because of this, we cannot rely on the start_time property to know when to timeout after several retries. For the implementation, will add another thread to Ambari that will monitor failed tasks only during active RU/EU and change the status back to PENDING so that Action Scheduler can reschedule it. Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking, so no other stages are allowed to proceed. In order to know when a task was first started, will add a new property to host_role_command table called original_start_time. For the agents, we need to ensure that they always write out a response. On the first heartbeat, it should send the status of its last command so we know it failed and Ambari can retry. Diffs (updated) - ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java 429f573 ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleCommand.java 2764b3f ambari-server/src/main/java/org/apache/ambari/server/agent/HeartbeatProcessor.java a1a686a ambari-server/src/main/java/org/apache/ambari/server/configuration/Configuration.java 9404506 ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java f5b1cb4 ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java 19f0602 ambari-server/src/main/java/org/apache/ambari/server/state/services/RetryUpgradeActionService.java PRE-CREATION ambari-server/src/main/java/org/apache/ambari/server/topology/HostRequest.java 9eb514a ambari-server/src/main/java/org/apache/ambari/server/topology/LogicalRequest.java 82edbcf ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog240.java 7b83710 ambari-server/src/main/resources/Ambari-DDL-MySQL-CREATE.sql a07c6fc ambari-server/src/main/resources/Ambari-DDL-Oracle-CREATE.sql b2b450a ambari-server/src/main/resources/Ambari-DDL-Postgres-CREATE.sql cec122e ambari-server/src/main/resources/Ambari-DDL-Postgres-EMBEDDED-CREATE.sql 96fc720 ambari-server/src/main/resources/Ambari-DDL-SQLAnywhere-CREATE.sql c425d6f ambari-server/src/main/resources/Ambari-DDL-SQLServer-CREATE.sql 2a89e26 ambari-server/src/test/java/org/apache/ambari/server/state/services/RetryUpgradeActionServiceTest.java PRE-CREATION Diff: https://reviews.apache.org/r/44926/diff/ Testing --- Verified on a live cluster. New unit test passed, waiting for full set of unit test results. Thanks, Alejandro Fernandez
Re: Review Request 44926: Auto-retry on failure during RU/EU
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/44926/ --- (Updated March 21, 2016, 6:57 p.m.) Review request for Ambari, Jonathan Hurley and Nate Cole. Changes --- Addressed comments and retested. Summary (updated) - Auto-retry on failure during RU/EU Bugs: AMBARI-15446 https://issues.apache.org/jira/browse/AMBARI-15446 Repository: ambari Description --- When a failure occurs during RU/EU and the task transitions to HOLDING_FAILED or HOLDING_TIMEDOUT, want Ambari to automatically retry up to up to x mins. This is useful when a host goes down as Ambari is running a task on it. ambari.properties will have 1 new parameter. E.g,. stack-upgrade.max_retry_timeout_mins=15 (by default, will not be present) If Ambari Server is restarted, it should be able to recover. Today, Action Scheduler increases the attempt_count whenever a task is retried, but it requires resetting the start_time to -1. Because of this, we cannot rely on the start_time property to know when to timeout after several retries. For the implementation, will add another thread to Ambari that will monitor failed tasks only during active RU/EU and change the status back to PENDING so that Action Scheduler can reschedule it. Luckily, any tasks in HOLDING_TIMEDOUT and HOLDING_FAILED states are blocking, so no other stages are allowed to proceed. In order to know when a task was first started, will add a new property to host_role_command table called original_start_time. For the agents, we need to ensure that they always write out a response. On the first heartbeat, it should send the status of its last command so we know it failed and Ambari can retry. Diffs (updated) - ambari-server/src/main/java/org/apache/ambari/server/actionmanager/ActionDBAccessorImpl.java 429f573 ambari-server/src/main/java/org/apache/ambari/server/actionmanager/HostRoleCommand.java 2764b3f ambari-server/src/main/java/org/apache/ambari/server/agent/HeartbeatProcessor.java a1a686a ambari-server/src/main/java/org/apache/ambari/server/configuration/Configuration.java 9404506 ambari-server/src/main/java/org/apache/ambari/server/orm/dao/HostRoleCommandDAO.java f5b1cb4 ambari-server/src/main/java/org/apache/ambari/server/orm/entities/HostRoleCommandEntity.java 19f0602 ambari-server/src/main/java/org/apache/ambari/server/state/services/RetryUpgradeActionService.java PRE-CREATION ambari-server/src/main/java/org/apache/ambari/server/topology/HostRequest.java 9eb514a ambari-server/src/main/java/org/apache/ambari/server/topology/LogicalRequest.java 82edbcf ambari-server/src/main/java/org/apache/ambari/server/upgrade/UpgradeCatalog240.java 7b83710 ambari-server/src/main/resources/Ambari-DDL-MySQL-CREATE.sql a07c6fc ambari-server/src/main/resources/Ambari-DDL-Oracle-CREATE.sql b2b450a ambari-server/src/main/resources/Ambari-DDL-Postgres-CREATE.sql cec122e ambari-server/src/main/resources/Ambari-DDL-Postgres-EMBEDDED-CREATE.sql 96fc720 ambari-server/src/main/resources/Ambari-DDL-SQLAnywhere-CREATE.sql c425d6f ambari-server/src/main/resources/Ambari-DDL-SQLServer-CREATE.sql 2a89e26 ambari-server/src/test/java/org/apache/ambari/server/state/services/RetryUpgradeActionServiceTest.java PRE-CREATION Diff: https://reviews.apache.org/r/44926/diff/ Testing (updated) --- Verified on a live cluster. New unit test passed, waiting for full set of unit test results. Thanks, Alejandro Fernandez