[jira] [Commented] (IMPALA-8634) Catalog client should be resilient to temporary Catalog outage

2019-09-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-8634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933844#comment-16933844
 ] 

ASF subversion and git services commented on IMPALA-8634:
-

Commit b96b3b0b1ca97e5d756392a159e22dfcd8bcae71 in impala's branch 
refs/heads/master from Sahil Takiar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=b96b3b0 ]

IMPALA-8634: Catalog client should retry RPCs

Add retries to catalogd RPCs. Previously, connection failures triggered
a retry, but failures on the actual RPC did not trigger a retry. This
change replaces all usages of ClientCache::DoRpc() in the
CatalogOpExecutor with ClientCache::DoRpcWithRetry(). This change moves
the connection retry loop to DoRpcWithRetry(), instead of relying on the
ClientCache to retry the connection.

This patch is based to IMPALA-8904, which adds similar functionality to
statestore RPCs.

Testing:
* Renamed test_statestore_rpc_errors.py to test_services_rpc_errors.py
and added new tests for catalogd RPC errors
* Added new tests to test_restart_services.py
* Ran core tests

Change-Id: I7f33ad2b36d301fb64e70a939e71decab0ca993c
Reviewed-on: http://gerrit.cloudera.org:8080/14246
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Catalog client should be resilient to temporary Catalog outage
> --
>
> Key: IMPALA-8634
> URL: https://issues.apache.org/jira/browse/IMPALA-8634
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Affects Versions: Impala 3.2.0
>Reporter: Michael Ho
>Assignee: Sahil Takiar
>Priority: Critical
>
> Currently, when the catalog server is down, catalog clients will fail all 
> RPCs sent to it. In essence, DDL queries will fail and the Impala service 
> becomes a lot less functional. Catalog clients should consider retrying 
> failed RPCs with some exponential backoff in between while catalog server is 
> being restarted after crashing. We probably need to add [a test 
> |https://github.com/apache/impala/blob/master/tests/custom_cluster/test_restart_services.py]
>  to exercise the paths of catalog restart to verify coordinators are 
> resilient to it.
> cc'ing [~stakiar], [~joemcdonnell], [~twm378]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-8634) Catalog client should be resilient to temporary Catalog outage

2019-09-13 Thread Sahil Takiar (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-8634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929371#comment-16929371
 ] 

Sahil Takiar commented on IMPALA-8634:
--

Actually I think we just do the exact same thing as IMPALA-8904 it should all 
work.

> Catalog client should be resilient to temporary Catalog outage
> --
>
> Key: IMPALA-8634
> URL: https://issues.apache.org/jira/browse/IMPALA-8634
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Affects Versions: Impala 3.2.0
>Reporter: Michael Ho
>Assignee: Sahil Takiar
>Priority: Critical
>
> Currently, when the catalog server is down, catalog clients will fail all 
> RPCs sent to it. In essence, DDL queries will fail and the Impala service 
> becomes a lot less functional. Catalog clients should consider retrying 
> failed RPCs with some exponential backoff in between while catalog server is 
> being restarted after crashing. We probably need to add [a test 
> |https://github.com/apache/impala/blob/master/tests/custom_cluster/test_restart_services.py]
>  to exercise the paths of catalog restart to verify coordinators are 
> resilient to it.
> cc'ing [~stakiar], [~joemcdonnell], [~twm378]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-8634) Catalog client should be resilient to temporary Catalog outage

2019-09-13 Thread Sahil Takiar (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-8634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929353#comment-16929353
 ] 

Sahil Takiar commented on IMPALA-8634:
--

The existing code actually already does this. The flags 
{{catalog_client_connection_num_retries}} and 
{{catalog_client_rpc_retry_interval_ms}} control the number of times the client 
tries to re-connect to the catalog.

The issue is that connection established is retried, but individual RPCs are 
not retried (unless the RPC hits a connection reset). So a fix would to use 
{{DoRpcWithRetry}} instead of {{DoRpc}} (similar to what was done in 
IMPALA-8904).

There is some odd behavior with the retry logic though. If there is a cached 
client connection, the catalogd crashes, and then a query runs, the impalad 
will retry the connection {{2 * catalog_client_connection_num_retries}} times 
because the RPC is retried and the connection established is retried. One way 
to fix this would be to remove the connection establishment retry and let the 
RPC retry handle all retries. The issue is that the way the code is written, 
that means any attempt to establish a new connection won't be retried (if it 
uses a cached connection it will be retried).

Ideally, the following scenarios are handled correctly (e.g. each are retried 
exactly {{catalog_client_connection_num_retries}} times):
* New connection establishment
* Cached connection resets
* RPC failures

Would be nice to rename {{catalog_client_connection_num_retries}} to 
{{catalog_client_rpc_num_retries}} as well.

> Catalog client should be resilient to temporary Catalog outage
> --
>
> Key: IMPALA-8634
> URL: https://issues.apache.org/jira/browse/IMPALA-8634
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Affects Versions: Impala 3.2.0
>Reporter: Michael Ho
>Assignee: Sahil Takiar
>Priority: Critical
>
> Currently, when the catalog server is down, catalog clients will fail all 
> RPCs sent to it. In essence, DDL queries will fail and the Impala service 
> becomes a lot less functional. Catalog clients should consider retrying 
> failed RPCs with some exponential backoff in between while catalog server is 
> being restarted after crashing. We probably need to add [a test 
> |https://github.com/apache/impala/blob/master/tests/custom_cluster/test_restart_services.py]
>  to exercise the paths of catalog restart to verify coordinators are 
> resilient to it.
> cc'ing [~stakiar], [~joemcdonnell], [~twm378]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org