[jira] [Resolved] (IMPALA-9246) Make crcutils building work on aarch64

2020-01-16 Thread huangtianhua (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangtianhua resolved IMPALA-9246.
--
Resolution: Fixed

fixed in:

[https://gerrit.cloudera.org/#/c/14901/]

> Make crcutils building work on aarch64
> --
>
> Key: IMPALA-9246
> URL: https://issues.apache.org/jira/browse/IMPALA-9246
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Assignee: huangtianhua
>Priority: Major
>
> Make crcutils failed on aarch64 platform:
> g++: error: unrecognized command line option '-msse2'
>  g++: error: unrecognized command line option '-mcrc32'
>  g++: error: unrecognized command line option '-msse2'
>  g++: error: unrecognized command line option '-mcrc32'
>  Makefile:856: recipe for target 
> 'code/libcrcutil_la-multiword_64_64_cl_i386_mmx.lo' failed
>  make: *** [code/libcrcutil_la-multiword_64_64_cl_i386_mmx.lo] Error 1
>  make: *** Waiting for unfinished jobs
>  Makefile:849: recipe for target 
> 'code/libcrcutil_la-multiword_128_64_gcc_amd64_sse2.lo' failed
>  make: *** [code/libcrcutil_la-multiword_128_64_gcc_amd64_sse2.lo] Error 1
>  {color:#de350b}g++: error: unrecognized command line option '-msse2'{color}
> {color:#de350b} g++: error: unrecognized command line option '-mcrc32'{color}
>  Makefile:842: recipe for target 'code/libcrcutil_la-crc32c_sse4.lo' failed
>  make: *** [code/libcrcutil_la-crc32c_sse4.lo] Error 1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-9278) error: expected primary-expression before ‘return’

2020-01-16 Thread huangtianhua (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangtianhua resolved IMPALA-9278.
--
Resolution: Fixed

> error: expected primary-expression before ‘return’
> --
>
> Key: IMPALA-9278
> URL: https://issues.apache.org/jira/browse/IMPALA-9278
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Assignee: huangtianhua
>Priority: Major
>
> An error raised when execute ./buildall.sh on aarch64:
> /home/jenkins/workspace/impala/be/src/gutil/atomicops-internals-x86.h:413:15: 
> error: expected primary-expression before ‘return’ new_val = return 
> impala::ArithmeticUtil::AsUnsigned(old_val, increment);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9278) error: expected primary-expression before ‘return’

2020-01-16 Thread huangtianhua (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangtianhua updated IMPALA-9278:
-
Parent: IMPALA-9236
Issue Type: Sub-task  (was: Bug)

> error: expected primary-expression before ‘return’
> --
>
> Key: IMPALA-9278
> URL: https://issues.apache.org/jira/browse/IMPALA-9278
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: huangtianhua
>Assignee: huangtianhua
>Priority: Major
>
> An error raised when execute ./buildall.sh on aarch64:
> /home/jenkins/workspace/impala/be/src/gutil/atomicops-internals-x86.h:413:15: 
> error: expected primary-expression before ‘return’ new_val = return 
> impala::ArithmeticUtil::AsUnsigned(old_val, increment);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9303) Add time now for aarch64

2020-01-16 Thread huangtianhua (Jira)
huangtianhua created IMPALA-9303:


 Summary: Add time now for aarch64
 Key: IMPALA-9303
 URL: https://issues.apache.org/jira/browse/IMPALA-9303
 Project: IMPALA
  Issue Type: Sub-task
Reporter: huangtianhua


System timer of ARMv8 runs at a different frequency than the CPU's. Add 
definition of CycleClock::Now to support aarch64.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9278) error: expected primary-expression before ‘return’

2020-01-16 Thread huangtianhua (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017753#comment-17017753
 ] 

huangtianhua commented on IMPALA-9278:
--

[~tarmstrong], but this issue is in atomicops-internals-x86.h, not sure why x86 
is ok:)

> error: expected primary-expression before ‘return’
> --
>
> Key: IMPALA-9278
> URL: https://issues.apache.org/jira/browse/IMPALA-9278
> Project: IMPALA
>  Issue Type: Bug
>Reporter: huangtianhua
>Assignee: huangtianhua
>Priority: Major
>
> An error raised when execute ./buildall.sh on aarch64:
> /home/jenkins/workspace/impala/be/src/gutil/atomicops-internals-x86.h:413:15: 
> error: expected primary-expression before ‘return’ new_val = return 
> impala::ArithmeticUtil::AsUnsigned(old_val, increment);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-9302) Multithreaded scanners don't check for filter effectiveness

2020-01-16 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-9302:
-

Assignee: Tim Armstrong

> Multithreaded scanners don't check for filter effectiveness
> ---
>
> Key: IMPALA-9302
> URL: https://issues.apache.org/jira/browse/IMPALA-9302
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
>  Labels: multithreading, performance
>
> This can be reproduced for TPC-H Q9. I saw this on scale factor 30 locally, 
> where the mt_dop=4 version of the query uses a lot more CPU in the scan than 
> the mt_dop=0 version.
> This turns out to be because none of the runtime filters are getting 
> disabled, not even the ineffective ones.
> {noformat}
>   Filter 2 (16.00 MB):
>  - Files processed: 0 (0)
>  - Files rejected: 0 (0)
>  - Files total: 0 (0)
>  - RowGroups processed: 0 (0)
>  - RowGroups rejected: 0 (0)
>  - RowGroups total: 0 (0)
>  - Rows processed: 30.97M (30970695)
>  - Rows rejected: 0 (0)
>  - Rows total: 31.01M (31009074)
>  - Splits processed: 0 (0)
>  - Splits rejected: 0 (0)
>  - Splits total: 0 (0)
>   Filter 4 (8.00 MB):
>  - Files processed: 0 (0)
>  - Files rejected: 0 (0)
>  - Files total: 0 (0)
>  - RowGroups processed: 0 (0)
>  - RowGroups rejected: 0 (0)
>  - RowGroups total: 0 (0)
>  - Rows processed: 30.97M (30970695)
>  - Rows rejected: 0 (0)
>  - Rows total: 31.01M (31009074)
>  - Splits processed: 0 (0)
>  - Splits rejected: 0 (0)
>  - Splits total: 0 (0)
>   Filter 5 (8.00 MB):
>  - Files processed: 0 (0)
>  - Files rejected: 0 (0)
>  - Files total: 0 (0)
>  - RowGroups processed: 0 (0)
>  - RowGroups rejected: 0 (0)
>  - RowGroups total: 0 (0)
>  - Rows processed: 30.97M (30970695)
>  - Rows rejected: 0 (0)
>  - Rows total: 31.01M (31009074)
>  - Splits processed: 0 (0)
>  - Splits rejected: 0 (0)
>  - Splits total: 0 (0)
>   Filter 8 (1.00 MB):
>  - Files processed: 0 (0)
>  - Files rejected: 0 (0)
>  - Files total: 0 (0)
>  - RowGroups processed: 0 (0)
>  - RowGroups rejected: 0 (0)
>  - RowGroups total: 0 (0)
>  - Rows processed: 31.01M (31009074)
>  - Rows rejected: 0 (0)
>  - Rows total: 31.01M (31009074)
>  - Splits processed: 0 (0)
>  - Splits rejected: 0 (0)
>  - Splits total: 0 (0)
>   Filter 10 (1.00 MB):
>  - Files processed: 0 (0)
>  - Files rejected: 0 (0)
>  - Files total: 0 (0)
>  - RowGroups processed: 0 (0)
>  - RowGroups rejected: 0 (0)
>  - RowGroups total: 0 (0)
>  - Rows processed: 31.01M (31009074)
>  - Rows rejected: 29.32M (29317263)
>  - Rows total: 31.01M (31009074)
>  - Splits processed: 0 (0)
>  - Splits rejected: 0 (0)
>  - Splits total: 0 (0)
> {noformat}
> In contrast here are the filters for mt_dop=0, where not all the rows are 
> processed.
> {noformat}
>   Filter 2 (16.00 MB):
>  - Files processed: 0 (0)
>  - Files rejected: 0 (0)
>  - Files total: 0 (0)
>  - RowGroups processed: 0 (0)
>  - RowGroups rejected: 0 (0)
>  - RowGroups total: 0 (0)
>  - Rows processed: 8.18M (8180257)
>  - Rows rejected: 0 (0)
>  - Rows total: 180.00M (179998372)
>  - Splits processed: 0 (0)
>  - Splits rejected: 0 (0)
>  - Splits total: 0 (0)
>   Filter 4 (8.00 MB):
>  - Files processed: 0 (0)
>  - Files rejected: 0 (0)
>  - Files total: 0 (0)
>  - RowGroups processed: 0 (0)
>  - RowGroups rejected: 0 (0)
>  - RowGroups total: 0 (0)
>  - Rows processed: 8.18M (8180257)
>  - Rows rejected: 0 (0)
>  - Rows total: 180.00M (179998372)
>  - Splits processed: 0 (0)
>  - Splits rejected: 0 (0)
>  - Splits total: 0 (0)
>   Filter 5 (8.00 MB):
>  - Files processed: 0 (0)
>  - Files rejected: 0 (0)
>  - Files 

[jira] [Created] (IMPALA-9302) Multithreaded scanners don't check for filter effectiveness

2020-01-16 Thread Tim Armstrong (Jira)
Tim Armstrong created IMPALA-9302:
-

 Summary: Multithreaded scanners don't check for filter 
effectiveness
 Key: IMPALA-9302
 URL: https://issues.apache.org/jira/browse/IMPALA-9302
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Reporter: Tim Armstrong


This can be reproduced for TPC-H Q9. I saw this on scale factor 30 locally, 
where the mt_dop=4 version of the query uses a lot more CPU in the scan than 
the mt_dop=0 version.

This turns out to be because none of the runtime filters are getting disabled, 
not even the ineffective ones.
{noformat}
  Filter 2 (16.00 MB):
 - Files processed: 0 (0)
 - Files rejected: 0 (0)
 - Files total: 0 (0)
 - RowGroups processed: 0 (0)
 - RowGroups rejected: 0 (0)
 - RowGroups total: 0 (0)
 - Rows processed: 30.97M (30970695)
 - Rows rejected: 0 (0)
 - Rows total: 31.01M (31009074)
 - Splits processed: 0 (0)
 - Splits rejected: 0 (0)
 - Splits total: 0 (0)
  Filter 4 (8.00 MB):
 - Files processed: 0 (0)
 - Files rejected: 0 (0)
 - Files total: 0 (0)
 - RowGroups processed: 0 (0)
 - RowGroups rejected: 0 (0)
 - RowGroups total: 0 (0)
 - Rows processed: 30.97M (30970695)
 - Rows rejected: 0 (0)
 - Rows total: 31.01M (31009074)
 - Splits processed: 0 (0)
 - Splits rejected: 0 (0)
 - Splits total: 0 (0)
  Filter 5 (8.00 MB):
 - Files processed: 0 (0)
 - Files rejected: 0 (0)
 - Files total: 0 (0)
 - RowGroups processed: 0 (0)
 - RowGroups rejected: 0 (0)
 - RowGroups total: 0 (0)
 - Rows processed: 30.97M (30970695)
 - Rows rejected: 0 (0)
 - Rows total: 31.01M (31009074)
 - Splits processed: 0 (0)
 - Splits rejected: 0 (0)
 - Splits total: 0 (0)
  Filter 8 (1.00 MB):
 - Files processed: 0 (0)
 - Files rejected: 0 (0)
 - Files total: 0 (0)
 - RowGroups processed: 0 (0)
 - RowGroups rejected: 0 (0)
 - RowGroups total: 0 (0)
 - Rows processed: 31.01M (31009074)
 - Rows rejected: 0 (0)
 - Rows total: 31.01M (31009074)
 - Splits processed: 0 (0)
 - Splits rejected: 0 (0)
 - Splits total: 0 (0)
  Filter 10 (1.00 MB):
 - Files processed: 0 (0)
 - Files rejected: 0 (0)
 - Files total: 0 (0)
 - RowGroups processed: 0 (0)
 - RowGroups rejected: 0 (0)
 - RowGroups total: 0 (0)
 - Rows processed: 31.01M (31009074)
 - Rows rejected: 29.32M (29317263)
 - Rows total: 31.01M (31009074)
 - Splits processed: 0 (0)
 - Splits rejected: 0 (0)
 - Splits total: 0 (0)
{noformat}

In contrast here are the filters for mt_dop=0, where not all the rows are 
processed.
{noformat}
  Filter 2 (16.00 MB):
 - Files processed: 0 (0)
 - Files rejected: 0 (0)
 - Files total: 0 (0)
 - RowGroups processed: 0 (0)
 - RowGroups rejected: 0 (0)
 - RowGroups total: 0 (0)
 - Rows processed: 8.18M (8180257)
 - Rows rejected: 0 (0)
 - Rows total: 180.00M (179998372)
 - Splits processed: 0 (0)
 - Splits rejected: 0 (0)
 - Splits total: 0 (0)
  Filter 4 (8.00 MB):
 - Files processed: 0 (0)
 - Files rejected: 0 (0)
 - Files total: 0 (0)
 - RowGroups processed: 0 (0)
 - RowGroups rejected: 0 (0)
 - RowGroups total: 0 (0)
 - Rows processed: 8.18M (8180257)
 - Rows rejected: 0 (0)
 - Rows total: 180.00M (179998372)
 - Splits processed: 0 (0)
 - Splits rejected: 0 (0)
 - Splits total: 0 (0)
  Filter 5 (8.00 MB):
 - Files processed: 0 (0)
 - Files rejected: 0 (0)
 - Files total: 0 (0)
 - RowGroups processed: 0 (0)
 - RowGroups rejected: 0 (0)
 - RowGroups total: 0 (0)
 - Rows processed: 8.18M (8180257)
 - Rows rejected: 0 (0)
 - Rows total: 180.00M (179998372)
 - Splits processed: 0 (0)
 - Splits rejected: 0 (0)
 - Splits total: 0 (0)
  Filter 8 (1.00 MB):
 - Files processed: 0 (0)
 - Files rejected: 0 (0)
 - Files total: 0 (0)
 - RowGroups 

[jira] [Created] (IMPALA-9301) Aux error info should detect multiple RPC failures

2020-01-16 Thread Sahil Takiar (Jira)
Sahil Takiar created IMPALA-9301:


 Summary: Aux error info should detect multiple RPC failures
 Key: IMPALA-9301
 URL: https://issues.apache.org/jira/browse/IMPALA-9301
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Sahil Takiar


Suggested during the review of 
([IMPALA-9296|http://issues.cloudera.org/browse/IMPALA-9296]) 
[https://gerrit.cloudera.org/#/c/15046/]

{quote}

I'm not sure that this is the right wa[y] to do it, since it means that if a 
backend sees multiple rpc failures in a single query only one will ever be 
reported to the coordinator.

Of course, I've been advocating for being aggressive about blacklisting. 
Suppose there were two rpc failures, then there are two cases here - either 
both rpcs were to the same other executor, in which case the fact that there 
were two failures makes us more confident something is going on with that 
executor and we might actually want to blacklist the executor twice (which will 
just extend the amount of time that it stays blacklisted for), or the two rpcs 
were to different executors, in which case if we only blacklist one of them if 
we then retry the query it may very well fail again.

And even if we do want to stay more conservative about blacklisting, you've 
suggested before (and I agree) that its generally preferable to report as much 
info about errors as we've got, and then centralize the logic for deciding how 
to act on those errors in the coordinator.

{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9300) Add a limit on the number of nodes that can be blacklisted per query

2020-01-16 Thread Sahil Takiar (Jira)
Sahil Takiar created IMPALA-9300:


 Summary: Add a limit on the number of nodes that can be 
blacklisted per query
 Key: IMPALA-9300
 URL: https://issues.apache.org/jira/browse/IMPALA-9300
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Sahil Takiar


We currently have no limit on the number of nodes that can be blacklisted if an 
Exec() RPC fails.

For data transfer (TransmitData()) RPC failures, we blacklist at most one node 
per status update (so typically one node per query).

It would be nice to have a global limit on the number of nodes blacklisted to 
prevent a single query from blacklisting a large part of the cluster. This can 
help guard against intermittent, cluster-wide, hardware issues that might only 
last a few seconds. It would be nice if the max number of blacklist-able nodes 
is a function of the cluster size (e.g. a query cannot blacklist more than a 
third of the nodes in the cluster).

TBD if the value should be configurable or not. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9253) Blacklist additional posix error codes for failed DataStreamService RPCs

2020-01-16 Thread Sahil Takiar (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar updated IMPALA-9253:
-
Parent: IMPALA-9299
Issue Type: Sub-task  (was: Improvement)

> Blacklist additional posix error codes for failed DataStreamService RPCs
> 
>
> Key: IMPALA-9253
> URL: https://issues.apache.org/jira/browse/IMPALA-9253
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: Sahil Takiar
>Priority: Major
>
> Filing as a follow up to 
> [IMPALA-9137|http://issues.cloudera.org/browse/IMPALA-9137], 
> [IMPALA-9137|http://issues.cloudera.org/browse/IMPALA-9137] blacklists a node 
> if a RPC fails with specific posix error codes:
>  * 107 = ENOTCONN: Transport endpoint is not connected
>  * 108 = ESHUTDOWN: Cannot send after transport endpoint shutdown
>  * 111 = ECONNREFUSED: Connection refused
> These codes were produced by running a query, killing a node running that 
> query, and then seeing what error codes the query failed with.
> There may be other error codes that are worth using for node blacklisting as 
> well. One way to come up with more error codes is to use iptables to 
> introduce network faults between Impala processes and see how RPCs fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9137) Blacklist node if a DataStreamService RPC to the node fails

2020-01-16 Thread Sahil Takiar (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar updated IMPALA-9137:
-
Parent: IMPALA-9299
Issue Type: Sub-task  (was: Bug)

> Blacklist node if a DataStreamService RPC to the node fails
> ---
>
> Key: IMPALA-9137
> URL: https://issues.apache.org/jira/browse/IMPALA-9137
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Fix For: Impala 3.4.0
>
>
> If a query fails because a RPC to a specific node failed, the query error 
> message will similar to one of the following:
> * {{ERROR: TransmitData() to 10.65.30.141:27000 failed: Network error: recv 
> got EOF from 10.65.30.141:27000 (error 108)}}
> * {{ERROR: TransmitData() to 10.65.29.251:27000 failed: Network error: recv 
> error from 0.0.0.0:0: Transport endpoint is not connected (error 107)}}
> * {{ERROR: TransmitData() to 10.65.26.254:27000 failed: Network error: Client 
> connection negotiation failed: client connection to 10.65.26.254:27000: 
> connect: Connection refused (error 111)}}
> * {{ERROR: EndDataStream() to 127.0.0.1:27002 failed: Network error: recv 
> error from 0.0.0.0:0: Transport endpoint is not connected (error 107)}}
> RPCs are already retried, so it is likely that something is wrong with the 
> target node. Perhaps it crashed or is so overloaded that it can't process RPC 
> requests. In any case, the Impala Coordinator should blacklist the target of 
> the failed RPC so that future queries don't fail with the same error.
> If the node crashed, the statestore will eventually remove the failed node 
> from the cluster as well. However, the statestore can take a while to detect 
> a failed node because it has a long timeout. The issue is that queries can 
> still fail in within the timeout window. 
> This is necessary for transparent query retries because if a node does crash, 
> it will take too long for the statestore to remove the crashed node from the 
> cluster. So any attempt at retrying a query will just fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9224) Blacklist nodes with faulty disks

2020-01-16 Thread Sahil Takiar (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar updated IMPALA-9224:
-
Parent: IMPALA-9299
Issue Type: Sub-task  (was: Improvement)

> Blacklist nodes with faulty disks
> -
>
> Key: IMPALA-9224
> URL: https://issues.apache.org/jira/browse/IMPALA-9224
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend
>Reporter: Sahil Takiar
>Priority: Critical
>
> Similar to IMPALA-8339 and IMPALA-9137, Impala should blacklist nodes with 
> faulty disks. Specifically, if a query fails because of a disk error, the 
> node with that disk should be blacklisted and the query should be retried.
> We shouldn't need to blacklist nodes that fail to read from HDFS / S3, since 
> they contain their own internal mechanisms for recovering from faulty disks. 
> We should only blacklist nodes when failing to read / write from *local* 
> disks.
> The two main components of Impala that read / write from local disk are the 
> spill-to-disk and data caching features. Whenever a query fails because of a 
> disk failure during spill-to-disk, the node should be blacklisted.
> Reads / writes from / to the data cache are a bit different. If a cache read 
> fails due to a disk error, the error will be printed out and the Lookup() 
> call to the cache will return 0 bytes read, which means it couldn't find the 
> data in the cache. This should cause the scan to fall back to a normal, 
> un-cached read. While this doesn't affect query correctness or the ability 
> for a query to complete, it can affect performance. Since cache failures 
> don't result in query failures, we might consider having a threshold of data 
> cache read / writes errors before blacklisting a node.
> We need to be careful to only capture specific disk failures - e.g. disk 
> quota, permission denied, etc. errors shouldn't result in blacklisting as 
> they typically are a result of system misconfiguration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9243) Coordinator Web UI should list which executors have been blacklisted

2020-01-16 Thread Sahil Takiar (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar updated IMPALA-9243:
-
Parent: IMPALA-9299
Issue Type: Sub-task  (was: Improvement)

> Coordinator Web UI should list which executors have been blacklisted
> 
>
> Key: IMPALA-9243
> URL: https://issues.apache.org/jira/browse/IMPALA-9243
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: Sahil Takiar
>Priority: Major
>
> Currently, information about which nodes are blacklisted only shows up in 
> runtime profiles and Coordinator logs. It would be nice to display 
> blacklisting information in the Web UI as well so that a user can view which 
> nodes are blacklisted at any given time.
> One potential place to put the blacklisting information is in the /backends 
> page, which already lists out all the backends part of the cluster. A new 
> column called "Status" which can have values of either "Active" or 
> "Blacklisted" would be nice (perhaps we should re-factor the "Quiescing" 
> column to use the new "Status" column as well). This is similar to what the 
> Spark Web UI does for blacklisted nodes: 
> [https://ndu0e1pobsf1dobtvj5nls3q-wpengine.netdna-ssl.com/wp-content/uploads/2019/08/BLACKLIST-SCHEDULING.png]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9137) Blacklist node if a DataStreamService RPC to the node fails

2020-01-16 Thread Sahil Takiar (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar updated IMPALA-9137:
-
Parent: (was: IMPALA-9124)
Issue Type: Bug  (was: Sub-task)

> Blacklist node if a DataStreamService RPC to the node fails
> ---
>
> Key: IMPALA-9137
> URL: https://issues.apache.org/jira/browse/IMPALA-9137
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Fix For: Impala 3.4.0
>
>
> If a query fails because a RPC to a specific node failed, the query error 
> message will similar to one of the following:
> * {{ERROR: TransmitData() to 10.65.30.141:27000 failed: Network error: recv 
> got EOF from 10.65.30.141:27000 (error 108)}}
> * {{ERROR: TransmitData() to 10.65.29.251:27000 failed: Network error: recv 
> error from 0.0.0.0:0: Transport endpoint is not connected (error 107)}}
> * {{ERROR: TransmitData() to 10.65.26.254:27000 failed: Network error: Client 
> connection negotiation failed: client connection to 10.65.26.254:27000: 
> connect: Connection refused (error 111)}}
> * {{ERROR: EndDataStream() to 127.0.0.1:27002 failed: Network error: recv 
> error from 0.0.0.0:0: Transport endpoint is not connected (error 107)}}
> RPCs are already retried, so it is likely that something is wrong with the 
> target node. Perhaps it crashed or is so overloaded that it can't process RPC 
> requests. In any case, the Impala Coordinator should blacklist the target of 
> the failed RPC so that future queries don't fail with the same error.
> If the node crashed, the statestore will eventually remove the failed node 
> from the cluster as well. However, the statestore can take a while to detect 
> a failed node because it has a long timeout. The issue is that queries can 
> still fail in within the timeout window. 
> This is necessary for transparent query retries because if a node does crash, 
> it will take too long for the statestore to remove the crashed node from the 
> cluster. So any attempt at retrying a query will just fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-8339) Coordinator should be more resilient to fragment instances startup failure

2020-01-16 Thread Sahil Takiar (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-8339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar updated IMPALA-8339:
-
Parent: IMPALA-9299
Issue Type: Sub-task  (was: Improvement)

> Coordinator should be more resilient to fragment instances startup failure
> --
>
> Key: IMPALA-8339
> URL: https://issues.apache.org/jira/browse/IMPALA-8339
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Distributed Exec
>Reporter: Michael Ho
>Assignee: Thomas Tauber-Marshall
>Priority: Critical
>  Labels: Availability, resilience
> Fix For: Impala 3.3.0
>
>
> Impala currently relies on statestore for cluster membership. When an Impala 
> executor goes offline, it may take a while for statestore to declare that 
> node as unavailable and for that information to be propagated to all 
> coordinator nodes. Within this window, some coordinator nodes may still 
> attempt to issue RPCs to the faulty node, resulting in RPC failures which 
> resulted in query failures. In other words, many queries may fail to start 
> within this window until all coordinator nodes get the latest information on 
> cluster membership.
> Going forward, coordinator may need to fall back to using backup executors 
> for each fragments in case some of the executors are not available. Moreover, 
> *coordinator should treat the cluster membership information from statestore 
> (or any external source of truth e.g. etcd) as hints instead of ground truth* 
> and adjust the scheduling of fragment instances based on the availability of 
> the executors from the coordinator's perspective.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9299) Node Blacklisting: Coordinators should blacklist unhealthy nodes

2020-01-16 Thread Sahil Takiar (Jira)
Sahil Takiar created IMPALA-9299:


 Summary: Node Blacklisting: Coordinators should blacklist 
unhealthy nodes
 Key: IMPALA-9299
 URL: https://issues.apache.org/jira/browse/IMPALA-9299
 Project: IMPALA
  Issue Type: New Feature
  Components: Backend
Reporter: Sahil Takiar
Assignee: Thomas Tauber-Marshall


Top level JIRA for Node Blacklisting.

High level description of node blacklisting, from IMPALA-8339:
{quote}
This patch adds the concept of a blacklist of executors to the
coordinator, which removes executors from consideration for query
scheduling. Blacklisting decisions are local to a given coordinator
and are not included in statestore updates.

The intention is to allow coordinators to be more aggressive about
deciding that an exeutor is unhealthy or unavailable, to minimize
failed queries in environments where cluster membership may be more
variable, rather than having to wait on the statestore heartbeat
mechanism to decide that the executor is down.
{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9287) test_kudu_table_create_without_hms fails on Hive-3 environment

2020-01-16 Thread Vihang Karajgaonkar (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017422#comment-17017422
 ] 

Vihang Karajgaonkar commented on IMPALA-9287:
-

Thanks [~skyyws]. Let me know if you need any help.

> test_kudu_table_create_without_hms fails on Hive-3 environment
> --
>
> Key: IMPALA-9287
> URL: https://issues.apache.org/jira/browse/IMPALA-9287
> Project: IMPALA
>  Issue Type: Test
>  Components: Infrastructure
>Affects Versions: Impala 3.4.0
>Reporter: Vihang Karajgaonkar
>Assignee: WangSheng
>Priority: Blocker
>  Labels: broken-build
>
> {{test_kudu_table_create_without_hms}} which was added recently in 
> IMPALA-9266 fails when Hive-3 is used. To reproduce the issue build Impala 
> after setting {{USE_CDP_HIVE=true}} and then run the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-9281) Inferred predicates not assigned to scan nodes when views are involved

2020-01-16 Thread Fang-Yu Rao (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017296#comment-17017296
 ] 

Fang-Yu Rao edited comment on IMPALA-9281 at 1/16/20 4:55 PM:
--

After some initial investigation, I found that the for-loop at 
[https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/Analyzer.java#L1754-L1780]
 in the method {{getBoundPredicates()}} of {{Analyzer.java}} did not create an 
inferred predicate for the query in the following according to the log file 
{{impalad.INFO}}. This is the Query 1 described in the problem description 
above. Note that we need to change the log level of 
{{org.apache.impala.analysis.Analyzer}} to {{TRACE}} at 
[http://localhost:25000/log_level].
{code:java}
select * 
from default.myview_1_on_2_parquet_tables a, myview_2_on_2_parquet_tables b 
where a.table_source = 'ONE' 
and a.table_source = b.table_source_a;
{code}
On the other hand, we could find some inferred predicates after executing the 
following query.
{code:java}
select * 
from default.myview_1_on_2_parquet_tables a, myview_2_on_2_parquet_tables b 
where a.c2 = b.c2a 
and a.c2 = 'one';
{code}
Specifically, we could find the following line in {{impalad.INFO}}. It turns 
out an inferred predicate {{pta1.c2a = 'one'}} was generated.
{code:java}
I0116 08:32:04.089465 21718 Analyzer.java:1750] 
ac4181dbf41da68e:c53b028e] new pred: default.pta1.c2a = 'one' 
BinaryPredicate{op==, SlotRef{label=default.pta1.c2a, path=c2a, type=STRING, 
id=11} StringLiteral{value=one}, isInferred=true}
{code}
According to 
[https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1422]
 and 
[https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1428],
 the list of inferred predicates {{conjuncts}} will later be fed into 
Analyzer.createEquivConjuncts(). If the inferred predicate(s) is(are) not 
correctly generated, then it seems that {{Analyzer.createEquivConjuncts()}} 
would not produce a plan that takes into consideration the inferred 
predicate(s) we expected, e.g., "{{b.table_source_a = 'ONE'}}".

The only difference between those 2 queries above is that the column involved 
in the first query, i.e., {{a.table_source}}, is a constant-valued column, 
whereas the column involved in the second query is not. Hence, we may need to 
figure out how the planner performs predicate inference under these 2 scenarios.




was (Author: fangyurao):
After some initial investigation, I found that the for-loop at 
https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/Analyzer.java#L1754-L1780
 in the method {{getBoundPredicates()}} of {{Analyzer.java}} did not create an 
inferred predicate for the query in the following according to the log file 
{{impalad.INFO}}. This is the Query 1 described in the problem description 
above.

{code:java}
select * 
from default.myview_1_on_2_parquet_tables a, myview_2_on_2_parquet_tables b 
where a.table_source = 'ONE' 
and a.table_source = b.table_source_a;
{code}

On the other hand, we could find some inferred predicates after executing the 
following query.
{code:java}
select * 
from default.myview_1_on_2_parquet_tables a, myview_2_on_2_parquet_tables b 
where a.c2 = b.c2a 
and a.c2 = 'one';
{code}
Specifically, we could find the following line in {{impalad.INFO}}. It turns 
out an inferred predicate {{pta1.c2a = 'one'}} was generated.
{code:java}
I0116 08:32:04.089465 21718 Analyzer.java:1750] 
ac4181dbf41da68e:c53b028e] new pred: default.pta1.c2a = 'one' 
BinaryPredicate{op==, SlotRef{label=default.pta1.c2a, path=c2a, type=STRING, 
id=11} StringLiteral{value=one}, isInferred=true}
{code}

According to 
https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1422
 and 
https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1428,
 the list of inferred predicates {{conjuncts}} will later be fed into 
Analyzer.createEquivConjuncts(). If the inferred predicate(s) is(are) not 
correctly generated, then it seems that {{Analyzer.createEquivConjuncts()}} 
would not produce a plan that takes into consideration the inferred 
predicate(s).


> Inferred predicates not assigned to scan nodes when views are involved
> --
>
> Key: IMPALA-9281
> URL: https://issues.apache.org/jira/browse/IMPALA-9281
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 3.4.0
>Reporter: Fang-Yu Rao
>Assignee: Fang-Yu Rao
>Priority: Major
> Attachments: profile_query_1_parquet.txt, profile_query_2_parquet.txt
>
>

[jira] [Commented] (IMPALA-9281) Inferred predicates not assigned to scan nodes when views are involved

2020-01-16 Thread Fang-Yu Rao (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017296#comment-17017296
 ] 

Fang-Yu Rao commented on IMPALA-9281:
-

After some initial investigation, I found that the for-loop at 
https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/Analyzer.java#L1754-L1780
 in the method {{getBoundPredicates()}} of {{Analyzer.java}} did not create an 
inferred predicate for the query in the following according to the log file 
{{impalad.INFO}}. This is the Query 1 described in the problem description 
above.

{code:java}
select * 
from default.myview_1_on_2_parquet_tables a, myview_2_on_2_parquet_tables b 
where a.table_source = 'ONE' 
and a.table_source = b.table_source_a;
{code}

On the other hand, we could find some inferred predicates after executing the 
following query.
{code:java}
select * 
from default.myview_1_on_2_parquet_tables a, myview_2_on_2_parquet_tables b 
where a.c2 = b.c2a 
and a.c2 = 'one';
{code}
Specifically, we could find the following line in {{impalad.INFO}}. It turns 
out an inferred predicate {{pta1.c2a = 'one'}} was generated.
{code:java}
I0116 08:32:04.089465 21718 Analyzer.java:1750] 
ac4181dbf41da68e:c53b028e] new pred: default.pta1.c2a = 'one' 
BinaryPredicate{op==, SlotRef{label=default.pta1.c2a, path=c2a, type=STRING, 
id=11} StringLiteral{value=one}, isInferred=true}
{code}

According to 
https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1422
 and 
https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L1428,
 the list of inferred predicates {{conjuncts}} will later be fed into 
Analyzer.createEquivConjuncts(). If the inferred predicate(s) is(are) not 
correctly generated, then it seems that {{Analyzer.createEquivConjuncts()}} 
would not produce a plan that takes into consideration the inferred 
predicate(s).


> Inferred predicates not assigned to scan nodes when views are involved
> --
>
> Key: IMPALA-9281
> URL: https://issues.apache.org/jira/browse/IMPALA-9281
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 3.4.0
>Reporter: Fang-Yu Rao
>Assignee: Fang-Yu Rao
>Priority: Major
> Attachments: profile_query_1_parquet.txt, profile_query_2_parquet.txt
>
>
> When a query involves the join of views each created based on multiple 
> tables, the inferred predicate(s) is(are) not assigned to the scan node(s). 
> This issue is/seems related to 
> https://issues.apache.org/jira/browse/IMPALA-4578#.
> In the following a minimum example to reproduce the phenomenon.
> {code:java}
> CREATE TABLE default.pt1 (
>c1 INT,
>c2 STRING
> ) 
> STORED AS PARQUET;
> insert into pt1 values (1, 'one');
> CREATE TABLE default.pt2 (
>c1 INT,
>c2 STRING
> ) 
> STORED AS PARQUET;
> insert into pt2 values (2, 'two');
> CREATE TABLE default.pta1 (
>c1a INT, 
>c2a STRING
> )
> STORED AS PARQUET;
> insert into pta1 values (1,'one');
> CREATE TABLE default.pta2 (
>c1a INT, 
>c2a STRING
> )
> STORED AS PARQUET;
> insert into pta2 values (2,'two');
> CREATE VIEW myview_1_on_2_parquet_tables AS 
> SELECT 'ONE' table_source, c1, c2 FROM `default`.pt1 
> UNION ALL 
> SELECT 'TWO' table_source, c1, c2 FROM `default`.pt2;
> CREATE VIEW myview_2_on_2_parquet_tables AS  
> SELECT 'ONE' table_source_a, c1a, c2a FROM `default`.pta1 
> UNION ALL 
> SELECT 'TWO' table_source_a, c1a, c2a FROM `default`.pta2;
> {code}
> For easy reference, the contents of tables {{pt1}}, {{pt2}}, {{pta1}}, 
> {{pta2}}, and views {{myview_1_on_2_tables}}, {{myview_2_on_2_tables}} are 
> also given as follows.
> Contents of table {{pt1}} afterwards:
> {code:java}
> ++-+
> | c1 | c2  |
> ++-+
> | 1  | one |
> ++-+
> {code}
> Contents of table {{pt2}} afterwards:
> {code:java}
> ++-+
> | c1 | c2  |
> ++-+
> | 2  | two |
> ++-+
> {code}
> Contents of table {{pta1}} afterwards:
> {code:java}
> +-+-+
> | c1a | c2a |
> +-+-+
> | 1   | one |
> +-+-+
> {code}
> Contents of table {{pta2}} afterwards:
> {code:java}
> +-+-+
> | c1a | c2a |
> +-+-+
> | 2   | two |
> +-+-+
> {code}
> Contents in {{myview_1_on_2_parquet_tables}} (union of tables {{t1}} and 
> {{t2}}):
> {code:java}
> +--++-+
> | table_source | c1 | c2  |
> +--++-+
> | ONE  | 1  | one |
> | TWO  | 2  | two |
> +--++-+
> {code}
> Contents in {{myview_2_on_2_parquet_tables}} (union of tables {{ta1}} and 
> {{ta2}}):
> {code:java}
> ++-+-+
> | table_source_a | c1a | c2a |
>