[jira] [Commented] (IMPALA-8634) Catalog client should be resilient to temporary Catalog outage
[ https://issues.apache.org/jira/browse/IMPALA-8634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933844#comment-16933844 ] ASF subversion and git services commented on IMPALA-8634: - Commit b96b3b0b1ca97e5d756392a159e22dfcd8bcae71 in impala's branch refs/heads/master from Sahil Takiar [ https://gitbox.apache.org/repos/asf?p=impala.git;h=b96b3b0 ] IMPALA-8634: Catalog client should retry RPCs Add retries to catalogd RPCs. Previously, connection failures triggered a retry, but failures on the actual RPC did not trigger a retry. This change replaces all usages of ClientCache::DoRpc() in the CatalogOpExecutor with ClientCache::DoRpcWithRetry(). This change moves the connection retry loop to DoRpcWithRetry(), instead of relying on the ClientCache to retry the connection. This patch is based to IMPALA-8904, which adds similar functionality to statestore RPCs. Testing: * Renamed test_statestore_rpc_errors.py to test_services_rpc_errors.py and added new tests for catalogd RPC errors * Added new tests to test_restart_services.py * Ran core tests Change-Id: I7f33ad2b36d301fb64e70a939e71decab0ca993c Reviewed-on: http://gerrit.cloudera.org:8080/14246 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins > Catalog client should be resilient to temporary Catalog outage > -- > > Key: IMPALA-8634 > URL: https://issues.apache.org/jira/browse/IMPALA-8634 > Project: IMPALA > Issue Type: Improvement > Components: Catalog >Affects Versions: Impala 3.2.0 >Reporter: Michael Ho >Assignee: Sahil Takiar >Priority: Critical > > Currently, when the catalog server is down, catalog clients will fail all > RPCs sent to it. In essence, DDL queries will fail and the Impala service > becomes a lot less functional. Catalog clients should consider retrying > failed RPCs with some exponential backoff in between while catalog server is > being restarted after crashing. We probably need to add [a test > |https://github.com/apache/impala/blob/master/tests/custom_cluster/test_restart_services.py] > to exercise the paths of catalog restart to verify coordinators are > resilient to it. > cc'ing [~stakiar], [~joemcdonnell], [~twm378] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8634) Catalog client should be resilient to temporary Catalog outage
[ https://issues.apache.org/jira/browse/IMPALA-8634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929371#comment-16929371 ] Sahil Takiar commented on IMPALA-8634: -- Actually I think we just do the exact same thing as IMPALA-8904 it should all work. > Catalog client should be resilient to temporary Catalog outage > -- > > Key: IMPALA-8634 > URL: https://issues.apache.org/jira/browse/IMPALA-8634 > Project: IMPALA > Issue Type: Improvement > Components: Catalog >Affects Versions: Impala 3.2.0 >Reporter: Michael Ho >Assignee: Sahil Takiar >Priority: Critical > > Currently, when the catalog server is down, catalog clients will fail all > RPCs sent to it. In essence, DDL queries will fail and the Impala service > becomes a lot less functional. Catalog clients should consider retrying > failed RPCs with some exponential backoff in between while catalog server is > being restarted after crashing. We probably need to add [a test > |https://github.com/apache/impala/blob/master/tests/custom_cluster/test_restart_services.py] > to exercise the paths of catalog restart to verify coordinators are > resilient to it. > cc'ing [~stakiar], [~joemcdonnell], [~twm378] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-8634) Catalog client should be resilient to temporary Catalog outage
[ https://issues.apache.org/jira/browse/IMPALA-8634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929353#comment-16929353 ] Sahil Takiar commented on IMPALA-8634: -- The existing code actually already does this. The flags {{catalog_client_connection_num_retries}} and {{catalog_client_rpc_retry_interval_ms}} control the number of times the client tries to re-connect to the catalog. The issue is that connection established is retried, but individual RPCs are not retried (unless the RPC hits a connection reset). So a fix would to use {{DoRpcWithRetry}} instead of {{DoRpc}} (similar to what was done in IMPALA-8904). There is some odd behavior with the retry logic though. If there is a cached client connection, the catalogd crashes, and then a query runs, the impalad will retry the connection {{2 * catalog_client_connection_num_retries}} times because the RPC is retried and the connection established is retried. One way to fix this would be to remove the connection establishment retry and let the RPC retry handle all retries. The issue is that the way the code is written, that means any attempt to establish a new connection won't be retried (if it uses a cached connection it will be retried). Ideally, the following scenarios are handled correctly (e.g. each are retried exactly {{catalog_client_connection_num_retries}} times): * New connection establishment * Cached connection resets * RPC failures Would be nice to rename {{catalog_client_connection_num_retries}} to {{catalog_client_rpc_num_retries}} as well. > Catalog client should be resilient to temporary Catalog outage > -- > > Key: IMPALA-8634 > URL: https://issues.apache.org/jira/browse/IMPALA-8634 > Project: IMPALA > Issue Type: Improvement > Components: Catalog >Affects Versions: Impala 3.2.0 >Reporter: Michael Ho >Assignee: Sahil Takiar >Priority: Critical > > Currently, when the catalog server is down, catalog clients will fail all > RPCs sent to it. In essence, DDL queries will fail and the Impala service > becomes a lot less functional. Catalog clients should consider retrying > failed RPCs with some exponential backoff in between while catalog server is > being restarted after crashing. We probably need to add [a test > |https://github.com/apache/impala/blob/master/tests/custom_cluster/test_restart_services.py] > to exercise the paths of catalog restart to verify coordinators are > resilient to it. > cc'ing [~stakiar], [~joemcdonnell], [~twm378] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org