[ 
https://issues.apache.org/jira/browse/IMPALA-8339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832710#comment-16832710
 ] 

Michael Ho edited comment on IMPALA-8339 at 5/3/19 6:18 PM:
------------------------------------------------------------

Thanks [~stakiar]. In general, Impala should be resilient when one or more 
executors fail during execution and issue transparent retries while not 
scheduling fragments on the known bad hosts. This JIRA is attacking a narrower 
subset of that problem by only addressing the query startup failure.

We need to be careful to use transparent retries only for transient recoverable 
failures. For instance, we shouldn't retry if it will lead to the same failure 
(e.g. memory limit exceeded). There may also be a change in behavior in how 
Impala exposes results to the clients. In particular, we may not be able to 
support both streaming result sets and transparent retries for all queries as 
some of them are non-deterministic so it may not be trivial to support the 
behavior of exposing a subset of the results and then replay to the point of 
failure.


was (Author: kwho):
Thanks [~stakiar]. In general, Impala should be resilient when one or more 
executors fail and issue transparent retries while not scheduling fragments on 
the known bad hosts. This JIRA is attacking a narrower subset of that problem 
by only addressing the query startup failure.

We need to be careful to use transparent retries only for transient recoverable 
failures. For instance, we shouldn't retry if it will lead to the same failure 
(e.g. memory limit exceeded). There may also be a change in behavior in how 
Impala may expose results to the clients. In particular, we may not be able to 
support both streaming result sets and transparent retries for all queries as 
some of them are non-deterministic so it may not be trivial to support the 
behavior of exposing a subset of the results and then replay to the point of 
failure.

> Coordinator should be more resilient to fragment instances startup failure
> --------------------------------------------------------------------------
>
>                 Key: IMPALA-8339
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8339
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Distributed Exec
>            Reporter: Michael Ho
>            Assignee: Thomas Tauber-Marshall
>            Priority: Major
>              Labels: Availability, resilience
>
> Impala currently relies on statestore for cluster membership. When an Impala 
> executor goes offline, it may take a while for statestore to declare that 
> node as unavailable and for that information to be propagated to all 
> coordinator nodes. Within this window, some coordinator nodes may still 
> attempt to issue RPCs to the faulty node, resulting in RPC failures which 
> resulted in query failures. In other words, many queries may fail to start 
> within this window until all coordinator nodes get the latest information on 
> cluster membership.
> Going forward, coordinator may need to fall back to using backup executors 
> for each fragments in case some of the executors are not available. Moreover, 
> *coordinator should treat the cluster membership information from statestore 
> (or any external source of truth e.g. etcd) as hints instead of ground truth* 
> and adjust the scheduling of fragment instances based on the availability of 
> the executors from the coordinator's perspective.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to