[jira] [Created] (IMPALA-9641) Query hang when containing alias names as empty backticks

2020-04-09 Thread Quanlong Huang (Jira)
Quanlong Huang created IMPALA-9641:
--

 Summary: Query hang when containing alias names as empty backticks
 Key: IMPALA-9641
 URL: https://issues.apache.org/jira/browse/IMPALA-9641
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Reporter: Quanlong Huang


The following query will hang in an infinite loop:
{code:java}
select 1 as "``";
{code}
Stacktrace of its compiler thread:
{code:java}
"Thread-19" #34 prio=5 os_prio=0 tid=0x12fc nid=0x5514 runnable 
[0x7f2abda41000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
- locked <0x0005cc90f7b8> (a java.io.BufferedOutputStream)
at java.io.PrintStream.write(PrintStream.java:482)
- locked <0x0005cc90f798> (a java.io.PrintStream)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.flushBuffer(StreamEncoder.java:104)
- locked <0x0005cc90f8d8> (a java.io.OutputStreamWriter)
at java.io.OutputStreamWriter.flushBuffer(OutputStreamWriter.java:185)
at java.io.PrintStream.write(PrintStream.java:527)
- locked <0x0005cc90f798> (a java.io.PrintStream)
at java.io.PrintStream.print(PrintStream.java:669)
at java.io.PrintStream.println(PrintStream.java:806)
- locked <0x0005cc90f798> (a java.io.PrintStream)
at 
org.antlr.runtime.BaseRecognizer.emitErrorMessage(BaseRecognizer.java:344)
at 
org.antlr.runtime.BaseRecognizer.displayRecognitionError(BaseRecognizer.java:194)
at org.antlr.runtime.Lexer.reportError(Lexer.java:261)
at org.antlr.runtime.Lexer.nextToken(Lexer.java:103)
at 
org.apache.impala.analysis.ToSqlUtils.hiveNeedsQuotes(ToSqlUtils.java:145)
at 
org.apache.impala.analysis.ToSqlUtils.getIdentSql(ToSqlUtils.java:199)
at org.apache.impala.analysis.SlotRef.(SlotRef.java:58)
at 
org.apache.impala.analysis.SelectStmt$SelectAnalyzer.analyzeSelectClause(SelectStmt.java:283)
at 
org.apache.impala.analysis.SelectStmt$SelectAnalyzer.analyze(SelectStmt.java:215)
at 
org.apache.impala.analysis.SelectStmt$SelectAnalyzer.access$100(SelectStmt.java:199)
at org.apache.impala.analysis.SelectStmt.analyze(SelectStmt.java:192)
at 
org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:473)
at 
org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:437)
at 
org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:1530)
at 
org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:1497)
at 
org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1467)
at 
org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:154)
{code}
org.antlr.runtime.Lexer keeps emitting the same error message to stderr (which 
is redirected to impalad.ERROR):
{code:java}
line 1:0 rule Identifier failed predicate: {allowQuotedId()}?
line 1:0 rule Identifier failed predicate: {allowQuotedId()}?
line 1:0 rule Identifier failed predicate: {allowQuotedId()}?
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-9621) Support iceberg on hdfs

2020-04-09 Thread WangSheng (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080176#comment-17080176
 ] 

WangSheng edited comment on IMPALA-9621 at 4/10/20, 2:38 AM:
-

[~tarmstrong]Hi Tim, thanks for your reply again. Do you mean shared the code 
of HdfsScanNode, and treat iceberg as another HdfsTable? Or just implement 
icebergTable with HdfsScanNode? 
We planned to implement this by treating iceberg as ICEBERG_PARQUET, just like 
HUDI_PARQUET as first. But after read iceberg source code, we found that 
metadata structure is different with impala, iceberg manage metadata itself by 
referring a hdfs location. Even if we can use HiveCatalog api, we cannot read 
iceberg data on hdfs directly, it doesn't like normal hdfs table structure: 
hfs://xxx/db/table/partition=xxx/xxx. 
As you mentioned above, a lot of it might be very different, so I will study 
the iceberg code more deeply to see if I can find a better way. Hope for your 
more advice, thanks!


was (Author: skyyws):
[~tarmstrong]Hi Tim, thanks for your reply again. Do you mean shared the code 
of HdfsScanNode, and treat iceberg as another HdfsTable? Or just implement 
icebergTable with HdfsScanNode? We planned to implement this by treating 
iceberg as ICEBERG_PARQUET, just like HUDI_PARQUET as first. But after read 
iceberg source code, we found that metadata structure is different with impala, 
iceberg manage metadata itself by referring a hdfs location. Even if we can use 
HiveCatalog api, we cannot read iceberg data on hdfs directly, it doesn't like 
normal hdfs table structure: hfs://xxx/db/table/partition=xxx/xxx. As you 
mentioned above, a lot of it might be very different, so I will study the 
iceberg code more deeply to see if I can find a better way. Hope for your more 
advice, thanks!

> Support iceberg on hdfs
> ---
>
> Key: IMPALA-9621
> URL: https://issues.apache.org/jira/browse/IMPALA-9621
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: WangSheng
>Assignee: WangSheng
>Priority: Major
>
> We are investigating iceberg recently, and preparing to implement select 
> iceberg data by impala. Our production use hdfs, so we will try to support 
> iceberg on hdfs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9621) Support iceberg on hdfs

2020-04-09 Thread WangSheng (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080176#comment-17080176
 ] 

WangSheng commented on IMPALA-9621:
---

[~tarmstrong]Hi Tim, thanks for your reply again. Do you mean shared the code 
of HdfsScanNode, and treat iceberg as another HdfsTable? Or just implement 
icebergTable with HdfsScanNode? We planned to implement this by treating 
iceberg as ICEBERG_PARQUET, just like HUDI_PARQUET as first. But after read 
iceberg source code, we found that metadata structure is different with impala, 
iceberg manage metadata itself by referring a hdfs location. Even if we can use 
HiveCatalog api, we cannot read iceberg data on hdfs directly, it doesn't like 
normal hdfs table structure: hfs://xxx/db/table/partition=xxx/xxx. As you 
mentioned above, a lot of it might be very different, so I will study the 
iceberg code more deeply to see if I can find a better way. Hope for your more 
advice, thanks!

> Support iceberg on hdfs
> ---
>
> Key: IMPALA-9621
> URL: https://issues.apache.org/jira/browse/IMPALA-9621
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: WangSheng
>Assignee: WangSheng
>Priority: Major
>
> We are investigating iceberg recently, and preparing to implement select 
> iceberg data by impala. Our production use hdfs, so we will try to support 
> iceberg on hdfs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9620) Predicates in the SELECT and GROUP-BY cause failure with CNF rewrite enabled

2020-04-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080164#comment-17080164
 ] 

ASF subversion and git services commented on IMPALA-9620:
-

Commit 293dc2ec92d0cadf1d3803a22e66e762d0ff6cf1 in impala's branch 
refs/heads/master from Aman Sinha
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=293dc2e ]

IMPALA-9620: Ensure group-by and cnf exprs are analyzed

This change initializes the SelectStmt's groupingExprs_
with the analyzed version. It also analyzes the new predicates
created by the Conjunctive Normal Form rewrite rule such that
potential consumers of this rewrite don't encounter problems.

Before this change, the SelectStmt.analyzeGroupingExprs() made
a deep copy of the original grouping exprs, then analyzed the
copy but left the original intact. This causes problems because
a rewrite rule (invoked by SelectStmt.rewriteExprs()) may try to
process the original grouping exprs and encounter INVALID_TYPE
(types are only assigned after analyze). This was the root cause
of the problem described in the JIRA. Although this was a pre-
existing behavior, it gets exposed when enable_cnf_rewrites=true.
Note that the deep-copied analyzed grouping exprs are supplied
to MultiAggregateInfo and since many operations are using
this data structure, we don't see widespread issues.

This patch fixes it and as a conservative measure, does the
analyze of new predicates in the CNF rule. (note: there are likely
other rewrite rules where explicit analyze should be done but
that is outside the scope for this issue).

Testing:
 - Added new unit tests with predicates in SELECT and GROUP BY
 - Ran 'mvn test' for the FE

Change-Id: I6da4a17c6e648f466ce118c4646520ff68f9878e
Reviewed-on: http://gerrit.cloudera.org:8080/15693
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Predicates in the SELECT and GROUP-BY cause failure with CNF rewrite enabled
> 
>
> Key: IMPALA-9620
> URL: https://issues.apache.org/jira/browse/IMPALA-9620
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 3.4.0
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
>
> Predicates can appear in the SELECT and GROUP BY list as part of IF(), CASE() 
> clauses.
> When enable_cnf_rewrites is set to true, such queries encounter failure 
> during planning as shown below. Queries run successfully when the flag is 
> disabled. 
> Note that the predicate does not have to be disjunctive predicate for this 
> failure to occur..even other types of predicates repro the issue. 
> {noformat}
> set enable_cnf_rewrites = true;
> select l_quantity, if(l_quantity < 5 or l_quantity > 45, 'invalid', 'valid') 
> from lineitem group by l_quantity, if(l_quantity < 5 or l_quantity > 45, 
> 'invalid', 'valid') limit 5
> ERROR: IllegalStateException: null
> {noformat}
> Stack trace:
> {noformat}
> I0407 17:40:40.306650 31240 jni-util.cc:288] 
> 2741e90d2edac592:c625a35f] java.lang.IllegalStateException
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:492)
> at org.apache.impala.analysis.SlotRef.getIdsHelper(SlotRef.java:229)
> at org.apache.impala.analysis.Expr.getIdsHelper(Expr.java:1286)
> at org.apache.impala.analysis.Expr.getIdsHelper(Expr.java:1286)
> at org.apache.impala.analysis.Expr.getIds(Expr.java:1279)
> at 
> org.apache.impala.rewrite.ConvertToCNFRule.convertToCNF(ConvertToCNFRule.java:111)
> at 
> org.apache.impala.rewrite.ConvertToCNFRule.apply(ConvertToCNFRule.java:86)
> at 
> org.apache.impala.rewrite.ExprRewriter.applyRuleBottomUp(ExprRewriter.java:85)
> at 
> org.apache.impala.rewrite.ExprRewriter.applyRuleBottomUp(ExprRewriter.java:83)
> at 
> org.apache.impala.rewrite.ExprRewriter.applyRuleRepeatedly(ExprRewriter.java:71)
> at 
> org.apache.impala.rewrite.ExprRewriter.rewrite(ExprRewriter.java:55)
> at 
> org.apache.impala.analysis.SelectStmt.rewriteCheckOrdinalResult(SelectStmt.java:1043)
> at 
> org.apache.impala.analysis.SelectStmt.rewriteExprs(SelectStmt.java:1068)
> at 
> org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:472)
> at 
> org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:415)
> at 
> org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:1530)
> at 
> org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:1497)
> {noformat}
> Other variations of the same pattern that also encounter similar failure:
> {noformat}
> explain select case when not (l_quantity = 5) then 0 else 1 end from lineitem 
> group by case when not (l_quantity = 

[jira] [Resolved] (IMPALA-9619) [DOC]: Document Impala support for Kudu DATE and VARCHAR columns

2020-04-09 Thread Kris Hahn (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kris Hahn resolved IMPALA-9619.
---
Resolution: Duplicate

created separate jiras for DATE and VARCHAR

> [DOC]: Document Impala support for Kudu DATE and VARCHAR columns
> 
>
> Key: IMPALA-9619
> URL: https://issues.apache.org/jira/browse/IMPALA-9619
> Project: IMPALA
>  Issue Type: Documentation
>  Components: Docs
>Reporter: Kris Hahn
>Assignee: Kris Hahn
>Priority: Major
>  Labels: Doc
>
> Search and update mentions of Kudu lack of support for DATE and VARCHAR 
> columns. See IMPALA-5092.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9640) [DOC]: Document Impala support for Kudu VARCHAR type

2020-04-09 Thread Kris Hahn (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080120#comment-17080120
 ] 

Kris Hahn commented on IMPALA-9640:
---

This documentation change consists of removing VARCHAR from the list of 
unsupported types in the Kudu considerations section in the shared common file.

> [DOC]: Document Impala support for Kudu VARCHAR type
> 
>
> Key: IMPALA-9640
> URL: https://issues.apache.org/jira/browse/IMPALA-9640
> Project: IMPALA
>  Issue Type: Documentation
>  Components: Docs
>Affects Versions: Not Applicable
>Reporter: Kris Hahn
>Assignee: Kris Hahn
>Priority: Major
>  Labels: docs
> Fix For: Not Applicable
>
>
> See Impala-5092. The length of a Kudu varchar is applied as a character 
> length as opposed to a byte length which Impala currently uses. Kudu tuples 
> containing VARCHAR columns use characters instead of bytes to limit the 
> length. In the case of ASCII values there is no difference. However, if 
> multi-byte characters are written to Kudu the length could be longer than 
> allowed.
> Impala checks the actual length and truncates the length of the value if 
> necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9640) [DOC]: Document Impala support for Kudu VARCHAR type

2020-04-09 Thread Kris Hahn (Jira)
Kris Hahn created IMPALA-9640:
-

 Summary: [DOC]: Document Impala support for Kudu VARCHAR type
 Key: IMPALA-9640
 URL: https://issues.apache.org/jira/browse/IMPALA-9640
 Project: IMPALA
  Issue Type: Documentation
  Components: Docs
Affects Versions: Not Applicable
Reporter: Kris Hahn
Assignee: Kris Hahn
 Fix For: Not Applicable


See Impala-5092. The length of a Kudu varchar is applied as a character length 
as opposed to a byte length which Impala currently uses. Kudu tuples containing 
VARCHAR columns use characters instead of bytes to limit the length. In the 
case of ASCII values there is no difference. However, if multi-byte characters 
are written to Kudu the length could be longer than allowed.
Impala checks the actual length and truncates the length of the value if 
necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-5746) Remote fragments continue to hold onto memory after stopping the coordinator daemon

2020-04-09 Thread Sahil Takiar (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080083#comment-17080083
 ] 

Sahil Takiar commented on IMPALA-5746:
--

Positing some ideas about how to fix this before I forget:
 * Decrease the 10 minute timeout to a lower value
 ** Perhaps we should do this regardless, 10 minutes seems like a long time to 
wait to detect a failure
 ** I guess the original intention of this flag is to handle the case where a 
backend is still part of the cluster membership, but sending status reports is 
failing for some other reason, so the timeout should be longer than the 
statestore heartbeat timeout (I think this is ~10 seconds by default)
 ** Not sure what the exact value should be, but I'm going to throw out 1 
minute as a suggestion - if a query can't report its status for over a minute, 
then something is probably really wrong with that impalad
 ** Maybe we should integrate this with node blacklisting? I haven't looked at 
this patch in detail, but it looks like we cancel all queries running on a 
backend that is "unresponsive", but do we prevent future queries from running 
on the fragment?
 * Make executors listen to the cluster membership topic and have the 
QueryExecMgr cancel all fragments running on a coordinator that has left the 
cluster membership - I think we already do something similar inside the 
Coordinator: 
[https://github.com/apache/impala/blob/5989900ae81a98d6977bdd60f2281da47e9f69b7/be/src/runtime/exec-env.cc#L555]

> Remote fragments continue to hold onto memory after stopping the coordinator 
> daemon
> ---
>
> Key: IMPALA-5746
> URL: https://issues.apache.org/jira/browse/IMPALA-5746
> Project: IMPALA
>  Issue Type: Bug
>  Components: Distributed Exec
>Affects Versions: Impala 2.10.0
>Reporter: Mostafa Mokhtar
>Assignee: Sahil Takiar
>Priority: Critical
> Attachments: remote_fragments_holding_memory.txt
>
>
> Repro 
> # Start running queries 
> # Kill the coordinator node 
> # On the running Impalad check the memz tab, remote fragments continue to run 
> and hold on to resources
> Remote fragments held on to memory +30 minutes after stopping the coordinator 
> service. 
> Attached thread dump from an Impalad running remote fragments .
> Snapshot of memz tab 30 minutes after killing the coordinator
> {code}
> Process: Limit=201.73 GB Total=5.32 GB Peak=179.36 GB
>   Free Disk IO Buffers: Total=1.87 GB Peak=1.87 GB
>   RequestPool=root.default: Total=1.35 GB Peak=178.51 GB
> Query(f64169d4bb3c901c:3a21d8ae): Total=2.64 MB Peak=104.73 MB
>   Fragment f64169d4bb3c901c:3a21d8ae0051: Total=2.64 MB Peak=2.67 MB
> AGGREGATION_NODE (id=15): Total=2.54 MB Peak=2.57 MB
>   Exprs: Total=30.12 KB Peak=30.12 KB
> EXCHANGE_NODE (id=14): Total=0 Peak=0
> DataStreamRecvr: Total=0 Peak=12.29 KB
> DataStreamSender (dst_id=17): Total=85.31 KB Peak=85.31 KB
> CodeGen: Total=1.53 KB Peak=374.50 KB
>   Block Manager: Limit=161.39 GB Total=512.00 KB Peak=1.54 MB
> Query(2a4f12b3b4b1dc8c:db7e8cf2): Total=258.29 MB Peak=412.98 MB
>   Fragment 2a4f12b3b4b1dc8c:db7e8cf2008c: Total=2.29 MB Peak=2.29 MB
> SORT_NODE (id=11): Total=4.00 KB Peak=4.00 KB
> AGGREGATION_NODE (id=20): Total=2.27 MB Peak=2.27 MB
>   Exprs: Total=25.12 KB Peak=25.12 KB
> EXCHANGE_NODE (id=19): Total=0 Peak=0
> DataStreamRecvr: Total=0 Peak=0
> DataStreamSender (dst_id=21): Total=3.88 KB Peak=3.88 KB
> CodeGen: Total=4.17 KB Peak=1.05 MB
>   Block Manager: Limit=161.39 GB Total=256.25 MB Peak=321.66 MB
> Query(68421d2a5dea0775:83f5d972): Total=282.77 MB Peak=443.53 MB
>   Fragment 68421d2a5dea0775:83f5d972004a: Total=26.77 MB Peak=26.92 MB
> SORT_NODE (id=8): Total=8.00 KB Peak=8.00 KB
>   Exprs: Total=4.00 KB Peak=4.00 KB
> ANALYTIC_EVAL_NODE (id=7): Total=4.00 KB Peak=4.00 KB
>   Exprs: Total=4.00 KB Peak=4.00 KB
> SORT_NODE (id=6): Total=24.00 MB Peak=24.00 MB
> AGGREGATION_NODE (id=12): Total=2.72 MB Peak=2.83 MB
>   Exprs: Total=85.12 KB Peak=85.12 KB
> EXCHANGE_NODE (id=11): Total=0 Peak=0
> DataStreamRecvr: Total=0 Peak=84.80 KB
> DataStreamSender (dst_id=13): Total=1.27 KB Peak=1.27 KB
> CodeGen: Total=24.80 KB Peak=4.13 MB
>   Block Manager: Limit=161.39 GB Total=280.50 MB Peak=286.52 MB
> Query(e94c89fa89a74d27:82812bf9): Total=258.29 MB Peak=436.85 MB
>   Fragment e94c89fa89a74d27:82812bf9008e: Total=2.29 MB Peak=2.29 MB
> SORT_NODE (id=11): Total=4.00 KB Peak=4.00 KB
> AGGREGATION_NODE (id=20): Total=2.27 MB Peak=2.27 MB
>   Exprs: 

[jira] [Commented] (IMPALA-5746) Remote fragments continue to hold onto memory after stopping the coordinator daemon

2020-04-09 Thread Sahil Takiar (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080076#comment-17080076
 ] 

Sahil Takiar commented on IMPALA-5746:
--

[~twmarshall] and I discussed this a bit on the review for the test-case 
[https://gerrit.cloudera.org/#/c/15666/] but moving the conversation here.

So, originally I though IMPALA-2990 fixes this, but unfortunately it looks like 
the situation is more complicated. There is at least one situation where 
killing a coordinator does not cause executors to kill any orphaned fragments. 
The fragments only get killed after the report status RPC fails for 10 minutes.

I ran the following query:
{code:java}
select * from tpch.lineitem t1, tpch.lineitem t2, tpch.lineitem t3 where 
t1.l_orderkey = t2.l_orderkey and t1.l_orderkey = t3.l_orderkey and 
t3.l_orderkey = t2.l_orderkey order by t1.l_orderkey, t2.l_orderkey, 
t3.l_orderkey limit 100;
{code}
On a cluster started via {{./bin/start-impala-cluster.py}} (oddly it looks like 
if I use a slightly different cluster topology, things are bit different - so 
perhaps there is a race condition somewhere).

Waited for the query to run for a bit (progress bar said it was bout 50% 
complete). Killed the coordinator, waited for a bit, and then looked at the 
/memz page for one of the executors, which showed this:
{code:java}
Process: Limit=7.28 GB Total=1.27 GB Peak=1.44 GB
  Buffer Pool: Free Buffers: Total=0
  Buffer Pool: Clean Pages: Total=0
  Buffer Pool: Unused Reservation: Total=-18.30 MB
  Control Service Queue: Limit=50.00 MB Total=0 Peak=15.24 KB
  Data Stream Service Queue: Limit=372.92 MB Total=0 Peak=2.01 MB
  Data Stream Manager Early RPCs: Total=0 Peak=0
  TCMalloc Overhead: Total=30.94 MB
  RequestPool=default-pool: Total=1.12 GB Peak=1.18 GB
Query(3e42b7e4a9f9b58b:72759e5d): Reservation=1.10 GB 
ReservationLimit=5.83 GB OtherMemory=17.17 MB Total=1.12 GB Peak=1.18 GB
  Runtime Filter Bank: Reservation=10.00 MB ReservationLimit=10.00 MB 
OtherMemory=0 Total=10.00 MB Peak=10.00 MB
  Fragment 3e42b7e4a9f9b58b:72759e5d0008: Reservation=0 OtherMemory=0 
Total=0 Peak=65.57 MB
HDFS_SCAN_NODE (id=2): Reservation=0 OtherMemory=0 Total=0 Peak=65.42 MB
KrpcDataStreamSender (dst_id=8): Total=0 Peak=150.41 KB
  Fragment 3e42b7e4a9f9b58b:72759e5d0005: Reservation=0 OtherMemory=0 
Total=0 Peak=65.57 MB
HDFS_SCAN_NODE (id=1): Reservation=0 OtherMemory=0 Total=0 Peak=65.42 MB
KrpcDataStreamSender (dst_id=7): Total=0 Peak=150.41 KB
  Fragment 3e42b7e4a9f9b58b:72759e5d0002: Reservation=0 OtherMemory=0 
Total=0 Peak=65.91 MB
HDFS_SCAN_NODE (id=0): Reservation=0 OtherMemory=0 Total=0 Peak=65.91 MB
KrpcDataStreamSender (dst_id=6): Total=0 Peak=150.41 KB
  Fragment 3e42b7e4a9f9b58b:72759e5d000b: Reservation=1.09 GB 
OtherMemory=17.06 MB Total=1.11 GB Peak=1.11 GB
SORT_NODE (id=5): Total=148.00 KB Peak=148.00 KB
HASH_JOIN_NODE (id=4): Reservation=558.00 MB OtherMemory=42.25 KB 
Total=558.04 MB Peak=558.06 MB
  Exprs: Total=13.12 KB Peak=13.12 KB
  Hash Join Builder (join_node_id=4): Total=13.12 KB Peak=21.12 KB
Hash Join Builder (join_node_id=4) Exprs: Total=13.12 KB Peak=13.12 
KB
HASH_JOIN_NODE (id=3): Reservation=558.00 MB OtherMemory=34.25 KB 
Total=558.03 MB Peak=558.05 MB
  Exprs: Total=13.12 KB Peak=13.12 KB
  Hash Join Builder (join_node_id=3): Total=13.12 KB Peak=21.12 KB
Hash Join Builder (join_node_id=3) Exprs: Total=13.12 KB Peak=13.12 
KB
EXCHANGE_NODE (id=6): Reservation=16.84 MB OtherMemory=0 Total=16.84 MB 
Peak=16.85 MB
  KrpcDeferredRpcs: Total=0 Peak=37.36 KB
EXCHANGE_NODE (id=7): Reservation=0 OtherMemory=0 Total=0 Peak=2.54 MB
  KrpcDeferredRpcs: Total=0 Peak=0
EXCHANGE_NODE (id=8): Reservation=0 OtherMemory=0 Total=0 Peak=16.69 MB
  KrpcDeferredRpcs: Total=0 Peak=37.56 KB
KrpcDataStreamSender (dst_id=9): Total=272.00 B Peak=272.00 B
  CodeGen: Total=12.64 KB Peak=696.50 KB
  CodeGen: Total=12.64 KB Peak=696.50 KB
  CodeGen: Total=12.64 KB Peak=696.50 KB
  CodeGen: Total=75.92 KB Peak=5.00 MB
  Untracked Memory: Total=147.84 MB
{code}
The logs of the Impala executor show:
{code:java}
I0409 14:44:01.852174 28903 kudu-status-util.h:55] 
3e42b7e4a9f9b58b:72759e5d] ReportExecStatus() RPC failed: Network 
error: Client connection negotiation failed: client connection to 
127.0.0.1:27000: connect: Connection refused (error 111)
W0409 14:44:01.852253 28903 query-state.cc:498] 
3e42b7e4a9f9b58b:72759e5d] Failed to send ReportExecStatus() RPC for 
query 3e42b7e4a9f9b58b:72759e5d. Consecutive failed reports = 9. Time 
spent retrying = 220034ms.
I0409 14:44:04.862691  8833 krpc-data-stream-mgr.cc:422] Reduced stream ID 
cache from 3 items, to 2, eviction took: 0
I0409 

[jira] [Created] (IMPALA-9639) [DOC]:Document Impala support for Kudu DATE type

2020-04-09 Thread Kris Hahn (Jira)
Kris Hahn created IMPALA-9639:
-

 Summary: [DOC]:Document Impala support for Kudu DATE type
 Key: IMPALA-9639
 URL: https://issues.apache.org/jira/browse/IMPALA-9639
 Project: IMPALA
  Issue Type: Documentation
  Components: Docs
Affects Versions: Not Applicable
Reporter: Kris Hahn
Assignee: Kris Hahn
 Fix For: Not Applicable


See IMPALA-8800, reading and writing DATE values
 to Kudu tables. Date is in the 3.4 branch. 
 Include some examples from Dev's checkin:
{noformat}
 QUERY
# create table with primary key of DATE type
create table kudu_date_key (fdatekey date primary key, val string)
stored as kudu
 RESULTS
'Table has been created.'
 ERRORS
Unpartitioned Kudu tables are inefficient for large data sizes.

 QUERY
insert into kudu_date_key values (DATE '1970-01-01', 'Unix epoch'), (DATE 
'2019-12-12', 'today')
 RUNTIME_PROFILE
NumModifiedRows: 2
NumRowErrors: 0
 LABELS
FDATEKEY,VAL
 DML_RESULTS: kudu_date_key
1970-01-01,'Unix epoch'
2019-12-12,'today'
 TYPES
DATE,STRING

 QUERY
# create table with DATE primary key partitioned by range
create table kudu_datepk_range (fdate DATE not null primary key)
partition by range (fdate)
(
  partition values < DATE '1900-01-01',
  partition DATE '1900-01-01' <= values < DATE '1970-01-01',
  partition DATE '1970-01-01' <= values < DATE '2000-01-01',
  partition DATE '2000-01-01' <= values
)
stored as kudu
 RESULTS
'Table has been created.'

 QUERY
insert into kudu_datepk_range values
  (DATE '1800-01-01'),
  (DATE '1970-01-01'),
  (DATE '2019-12-12')
 RUNTIME_PROFILE
NumModifiedRows: 3
NumRowErrors: 0
 LABELS
FDATE
 DML_RESULTS: kudu_datepk_range
1800-01-01
1970-01-01
2019-12-12
 TYPES

DATE

 QUERY
select * from kudu_datepk_range;
 RESULTS
1800-01-01
1970-01-01
2019-12-12
 TYPES
DATE
 QUERY# Test date columns and primary keycreate table 
describe_date_test( date_pk date PRIMARY KEY, date_val date not null, date_null 
date null)stored as kudu;describe describe_date_test; 
LABELSNAME,TYPE,COMMENT,PRIMARY_KEY,NULLABLE,DEFAULT_VALUE,ENCODING,COMPRESSION,BLOCK_SIZE
 
RESULTS'date_pk','date','','true','false','','AUTO_ENCODING','DEFAULT_COMPRESSION','0''date_val','date','','false','false','','AUTO_ENCODING','DEFAULT_COMPRESSION','0''date_null','date','','false','true','','AUTO_ENCODING','DEFAULT_COMPRESSION','0'
 TYPESSTRING

{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-9596) TestNestedTypesNoMtDop.test_tpch_mem_limit_single_node failed

2020-04-09 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-9596 started by Tim Armstrong.
-
> TestNestedTypesNoMtDop.test_tpch_mem_limit_single_node failed
> -
>
> Key: IMPALA-9596
> URL: https://issues.apache.org/jira/browse/IMPALA-9596
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 4.0
>Reporter: Yongzhi Chen
>Assignee: Tim Armstrong
>Priority: Blocker
>  Labels: broken-build, flaky
>
> parallel-all-tests-nightly failed with:
> query_test.test_nested_types.TestNestedTypesNoMtDop.test_tpch_mem_limit_single_node[protocol:
>  beeswax | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': False, 
> 'abort_on_error': 1, 'exec_single_node_rows_threshold': 0} | table_format: 
> orc/def/block] (from pytest)
> Error Message
> query_test/test_nested_types.py:159: in test_tpch_mem_limit_single_node 
> new_vector, use_db='tpch_nested' + db_suffix) 
> common/impala_test_suite.py:665: in run_test_case 
> self.__verify_exceptions(test_section['CATCH'], str(e), use_db) 
> common/impala_test_suite.py:481: in __verify_exceptions (expected_str, 
> actual_str) E   AssertionError: Unexpected exception string. Expected: 
> row_regex: .*Memory limit exceeded: Failed to allocate [0-9]+ bytes for 
> collection 'tpch_nested_.*.customer.c_orders.item.o_lineitems'.* E   Not 
> found in actual: ImpalaBeeswaxException: Query aborted:Memory limit exceeded: 
> Error occurred on backend a3c3b59bd11e:22000 by fragment 
> c3440dc607fd4a18:25e42cf5Memory left in process limit: 8.26 GBMemory 
> left in query limit: -3.94 KBQuery(c3440dc607fd4a18:25e42cf5): memory 
> limit exceeded. Limit=20.00 MB Reservation=8.00 MB ReservationLimit=24.00 MB 
> OtherMemory=12.00 MB Total=20.00 MB Peak=20.00 MB  Fragment 
> c3440dc607fd4a18:25e42cf5: Reservation=8.00 MB OtherMemory=12.00 MB 
> Total=20.00 MB Peak=20.00 MBAGGREGATION_NODE (id=6): Total=24.00 KB 
> Peak=24.00 KB  NonGroupingAggregator 0: Total=8.00 KB Peak=8.00 KB
> Exprs: Total=4.00 KB Peak=4.00 KBSUBPLAN_NODE (id=1): Total=4.81 MB 
> Peak=4.81 MBHDFS_SCAN_NODE (id=0): Reservation=8.00 MB OtherMemory=7.12 
> MB Total=15.12 MB Peak=15.12 MBNESTED_LOOP_JOIN_NODE (id=5): Total=24.00 
> KB Peak=24.00 KB  Nested Loop Join Builder: Total=8.00 KB Peak=8.00 KB
> AGGREGATION_NODE (id=4): Total=24.00 KB Peak=24.00 KB  
> NonGroupingAggregator 0: Total=16.00 KB Peak=16.00 KBExprs: 
> Total=4.00 KB Peak=4.00 KBUNNEST_NODE (id=3): Total=0 Peak=0
> SINGULAR_ROW_SRC_NODE (id=2): Total=0 Peak=0PLAN_ROOT_SINK: Total=0 
> Peak=0  CodeGen: Total=2.48 KB Peak=351.00 KB
> Stacktrace
> query_test/test_nested_types.py:159: in test_tpch_mem_limit_single_node
> new_vector, use_db='tpch_nested' + db_suffix)
> common/impala_test_suite.py:665: in run_test_case
> self.__verify_exceptions(test_section['CATCH'], str(e), use_db)
> common/impala_test_suite.py:481: in __verify_exceptions
> (expected_str, actual_str)
> E   AssertionError: Unexpected exception string. Expected: row_regex: 
> .*Memory limit exceeded: Failed to allocate [0-9]+ bytes for collection 
> 'tpch_nested_.*.customer.c_orders.item.o_lineitems'.*
> E   Not found in actual: ImpalaBeeswaxException: Query aborted:Memory limit 
> exceeded: Error occurred on backend a3c3b59bd11e:22000 by fragment 
> c3440dc607fd4a18:25e42cf5Memory left in process limit: 8.26 GBMemory 
> left in query limit: -3.94 KBQuery(c3440dc607fd4a18:25e42cf5): memory 
> limit exceeded. Limit=20.00 MB Reservation=8.00 MB ReservationLimit=24.00 MB 
> OtherMemory=12.00 MB Total=20.00 MB Peak=20.00 MB  Fragment 
> c3440dc607fd4a18:25e42cf5: Reservation=8.00 MB OtherMemory=12.00 MB 
> Total=20.00 MB Peak=20.00 MBAGGREGATION_NODE (id=6): Total=24.00 KB 
> Peak=24.00 KB  NonGroupingAggregator 0: Total=8.00 KB Peak=8.00 KB
> Exprs: Total=4.00 KB Peak=4.00 KBSUBPLAN_NODE (id=1): Total=4.81 MB 
> Peak=4.81 MBHDFS_SCAN_NODE (id=0): Reservation=8.00 MB OtherMemory=7.12 
> MB Total=15.12 MB Peak=15.12 MBNESTED_LOOP_JOIN_NODE (id=5): Total=24.00 
> KB Peak=24.00 KB  Nested Loop Join Builder: Total=8.00 KB Peak=8.00 KB
> AGGREGATION_NODE (id=4): Total=24.00 KB Peak=24.00 KB  
> NonGroupingAggregator 0: Total=16.00 KB Peak=16.00 KBExprs: 
> Total=4.00 KB Peak=4.00 KBUNNEST_NODE (id=3): Total=0 Peak=0
> SINGULAR_ROW_SRC_NODE (id=2): Total=0 Peak=0PLAN_ROOT_SINK: Total=0 
> Peak=0  CodeGen: Total=2.48 KB Peak=351.00 KB



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (IMPALA-9638) Don't create unnecessary threads in executor-only impalads

2020-04-09 Thread Sahil Takiar (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080044#comment-17080044
 ] 

Sahil Takiar commented on IMPALA-9638:
--

Noticed this while looking into IMPALA-9609, so linking the two.

> Don't create unnecessary threads in executor-only impalads
> --
>
> Key: IMPALA-9638
> URL: https://issues.apache.org/jira/browse/IMPALA-9638
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Sahil Takiar
>Priority: Major
>
> Looking through the code for ImpalaServer, it looks like there are several 
> threads / thread-pools that are started on executor-only impalads, but don't 
> actually do anything:
>  * The "cancellation-worker" threadpool (ImpalaServer::CancelFromThreadPool)
>  * The "session-maintenance" thread (ImpalaServer::SessionMaintenance)
>  * The "query-expirer" thread (ImpalaServer::ExpireQueries)
>  * The "unresponsive-backend-thread" thread 
> (ImpalaServer::UnresponsiveBackendThread)
>  * The code to start ImpalaInternalService might be dead code, but maybe we 
> should just delete it since the ImpalaInternalService doesn't exist anymore
> Confirmed this my creating a cluster with dedicated coordinator and getting a 
> thread dump of an executor, which showed the following:
> {code:java}
> Thread 16 (Thread 0x7fe96fe57700 (LWP 8721)):
> #0  0x7fea1ad20360 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib/x86_64-linux-gnu/libpthread.so.0
> No symbol table info available.
> #1  0x01ce4423 in impala::ConditionVariable::Wait (this=0x110b3370, 
> lock=...) at /home/stakiar/Impala/be/src/util/condition-variable.h:49
> mutex = 0x110b3348
> #2  0x02469f6a in impala::ImpalaServer::SessionMaintenance 
> (this=0x110b3200) at /home/stakiar/Impala/be/src/service/impala-server.cc:2055
> timeout_lock = {_M_device = 0x110b3348, _M_owns = true}
> now = 140640581412928
> expired_cnt = 0 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9638) Don't create unnecessary threads in executor-only impalads

2020-04-09 Thread Sahil Takiar (Jira)
Sahil Takiar created IMPALA-9638:


 Summary: Don't create unnecessary threads in executor-only impalads
 Key: IMPALA-9638
 URL: https://issues.apache.org/jira/browse/IMPALA-9638
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Reporter: Sahil Takiar


Looking through the code for ImpalaServer, it looks like there are several 
threads / thread-pools that are started on executor-only impalads, but don't 
actually do anything:
 * The "cancellation-worker" threadpool (ImpalaServer::CancelFromThreadPool)
 * The "session-maintenance" thread (ImpalaServer::SessionMaintenance)
 * The "query-expirer" thread (ImpalaServer::ExpireQueries)
 * The "unresponsive-backend-thread" thread 
(ImpalaServer::UnresponsiveBackendThread)
 * The code to start ImpalaInternalService might be dead code, but maybe we 
should just delete it since the ImpalaInternalService doesn't exist anymore

Confirmed this my creating a cluster with dedicated coordinator and getting a 
thread dump of an executor, which showed the following:
{code:java}
Thread 16 (Thread 0x7fe96fe57700 (LWP 8721)):
#0  0x7fea1ad20360 in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x01ce4423 in impala::ConditionVariable::Wait (this=0x110b3370, 
lock=...) at /home/stakiar/Impala/be/src/util/condition-variable.h:49
mutex = 0x110b3348
#2  0x02469f6a in impala::ImpalaServer::SessionMaintenance 
(this=0x110b3200) at /home/stakiar/Impala/be/src/service/impala-server.cc:2055
timeout_lock = {_M_device = 0x110b3348, _M_owns = true}
now = 140640581412928
expired_cnt = 0 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9637) Scan range load-balancing within backend

2020-04-09 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-9637:
--
Description: 
Currently the scheduler statically divides scan ranges between fragment 
instances, Since IMPALA-9015 it statically load-balances scan ranges based on 
file size using the LPT algorithm in the schedule.

This has various pitfalls:
 * It interacts badly with dynamic partition pruning, which can filter out a 
bunch of scan ranges and unbalance the laod
 * Different files that have the same byte size may involve different amounts 
of work to process for any number of reasons.

Those can cause both inter-node load balance problems and intra-node load 
balance problems. This Jira is about fixing the intra-node load balance 
problem, so that the situation is no worse than before mt_dop.

The proposed solution is to have a queue of scan ranges per backend, sorted 
from largest to smallest, and have each instance pull scan ranges off that 
queue. The DiskIOMgr ReaderContext probably is already sufficient to solve this 
problem, and we'll need to add a different mechanism for Kudu, Hbase, etc.

  was:
Currently the scheduler statically divides scan ranges between fragment 
instances, Since IMPALA-9015 it statically load-balances scan ranges based on 
file size using the LPT algorithm in the schedule.

This has various pitfalls:
 * It interacts badly with dynamic partition pruning, which can filter
 * Different files that have the same byte size may involve different amounts 
of work to process for any number of reasons.

Those can cause both inter-node load balance problems and intra-node load 
balance problems. This Jira is about fixing the intra-node load balance 
problem, so that the situation is no worse than before mt_dop.

The proposed solution is to have a queue of scan ranges per backend, sorted 
from largest to smallest, and have each instance pull scan ranges off that 
queue. The DiskIOMgr ReaderContext probably is already sufficient to solve this 
problem, and we'll need to add a different mechanism for Kudu, Hbase, etc.


> Scan range load-balancing within backend
> 
>
> Key: IMPALA-9637
> URL: https://issues.apache.org/jira/browse/IMPALA-9637
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Distributed Exec
>Affects Versions: Impala 4.0
>Reporter: Tim Armstrong
>Priority: Major
>  Labels: multithreading, performance
>
> Currently the scheduler statically divides scan ranges between fragment 
> instances, Since IMPALA-9015 it statically load-balances scan ranges based on 
> file size using the LPT algorithm in the schedule.
> This has various pitfalls:
>  * It interacts badly with dynamic partition pruning, which can filter out a 
> bunch of scan ranges and unbalance the laod
>  * Different files that have the same byte size may involve different amounts 
> of work to process for any number of reasons.
> Those can cause both inter-node load balance problems and intra-node load 
> balance problems. This Jira is about fixing the intra-node load balance 
> problem, so that the situation is no worse than before mt_dop.
> The proposed solution is to have a queue of scan ranges per backend, sorted 
> from largest to smallest, and have each instance pull scan ranges off that 
> queue. The DiskIOMgr ReaderContext probably is already sufficient to solve 
> this problem, and we'll need to add a different mechanism for Kudu, Hbase, 
> etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-5314) Rename single letter tables in FE tests

2020-04-09 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-5314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong reassigned IMPALA-5314:
-

Assignee: Aayush Bhan  (was: abipc)

> Rename single letter tables in FE tests
> ---
>
> Key: IMPALA-5314
> URL: https://issues.apache.org/jira/browse/IMPALA-5314
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.9.0
>Reporter: Lars Volker
>Assignee: Aayush Bhan
>Priority: Minor
>  Labels: newbie
>
> I frequently create test tables on my local system with names like "t" or 
> "p". A couple of frontend tests use the same names and then fail with "Table 
> already exists".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-5314) Rename single letter tables in FE tests

2020-04-09 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-5314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079996#comment-17079996
 ] 

Tim Armstrong commented on IMPALA-5314:
---

[~AayushBhan] I'll add you as a contributor in Jira and assign it to you. 
Thanks for the interest!

> Rename single letter tables in FE tests
> ---
>
> Key: IMPALA-5314
> URL: https://issues.apache.org/jira/browse/IMPALA-5314
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.9.0
>Reporter: Lars Volker
>Assignee: abipc
>Priority: Minor
>  Labels: newbie
>
> I frequently create test tables on my local system with names like "t" or 
> "p". A couple of frontend tests use the same names and then fail with "Table 
> already exists".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9637) Scan range load-balancing within backend

2020-04-09 Thread Tim Armstrong (Jira)
Tim Armstrong created IMPALA-9637:
-

 Summary: Scan range load-balancing within backend
 Key: IMPALA-9637
 URL: https://issues.apache.org/jira/browse/IMPALA-9637
 Project: IMPALA
  Issue Type: Improvement
  Components: Distributed Exec
Affects Versions: Impala 4.0
Reporter: Tim Armstrong


Currently the scheduler statically divides scan ranges between fragment 
instances, Since IMPALA-9015 it statically load-balances scan ranges based on 
file size using the LPT algorithm in the schedule.

This has various pitfalls:
 * It interacts badly with dynamic partition pruning, which can filter
 * Different files that have the same byte size may involve different amounts 
of work to process for any number of reasons.

Those can cause both inter-node load balance problems and intra-node load 
balance problems. This Jira is about fixing the intra-node load balance 
problem, so that the situation is no worse than before mt_dop.

The proposed solution is to have a queue of scan ranges per backend, sorted 
from largest to smallest, and have each instance pull scan ranges off that 
queue. The DiskIOMgr ReaderContext probably is already sufficient to solve this 
problem, and we'll need to add a different mechanism for Kudu, Hbase, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-5314) Rename single letter tables in FE tests

2020-04-09 Thread Aayush Bhan (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-5314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079961#comment-17079961
 ] 

Aayush Bhan commented on IMPALA-5314:
-

Is anyone working on this ? If not, I'd like to pick this up.

> Rename single letter tables in FE tests
> ---
>
> Key: IMPALA-5314
> URL: https://issues.apache.org/jira/browse/IMPALA-5314
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.9.0
>Reporter: Lars Volker
>Assignee: abipc
>Priority: Minor
>  Labels: newbie
>
> I frequently create test tables on my local system with names like "t" or 
> "p". A couple of frontend tests use the same names and then fail with "Table 
> already exists".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9636) Retried queries that blacklist nodes should ensure they don't run on the blacklisted node

2020-04-09 Thread Sahil Takiar (Jira)
Sahil Takiar created IMPALA-9636:


 Summary: Retried queries that blacklist nodes should ensure they 
don't run on the blacklisted node
 Key: IMPALA-9636
 URL: https://issues.apache.org/jira/browse/IMPALA-9636
 Project: IMPALA
  Issue Type: Sub-task
Reporter: Sahil Takiar


When a query is retried due to a node blacklisting event, there is no guarantee 
that the retried query will *not* run on the blacklisted node. When a node is 
blacklisted, it is only placed on the blacklist for a certain period of time 
(the first time it is blacklisted I think it is only about 12 seconds). It is 
possible that retrying the query takes a while (perhaps the query has to wait 
in the admission control queue again). So it is possible that the retried query 
will end up running on the node that it blacklisted during its original 
attempt, which is probably unwise because that node caused the query to fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-8632) Add support for self-event detection for insert events

2020-04-09 Thread Xiaomeng Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaomeng Zhang resolved IMPALA-8632.

Fix Version/s: Impala 4.0
   Resolution: Fixed

> Add support for self-event detection for insert events
> --
>
> Key: IMPALA-8632
> URL: https://issues.apache.org/jira/browse/IMPALA-8632
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: Vihang Karajgaonkar
>Assignee: Xiaomeng Zhang
>Priority: Critical
> Fix For: Impala 4.0
>
>
> In case of {{INSERT_EVENTS}} if Impala inserts into a table it causes a 
> refresh to the underlying table/partition. This could be unnecessary when 
> there is only one Impala cluster in the system. The existing self-event 
> detection framework cannot identify such events because they are not sending 
> HMS objects like tables and partitions to the HMS. Instead in case of 
> {{INSERT_EVENT}} HMS API only asks for a table name or partition value to 
> fire a insert event on it. 
> We can detect a self-event in such cases if the HMS API to fire a listener 
> event is improved to return the event id. This would be used by 
> EventProcessor to ignore the event when it is fetched later in the next 
> polling cycle. In order to support this, we will need to make a change to 
> Hive as well so that the enhanced API can be used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9621) Support iceberg on hdfs

2020-04-09 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079594#comment-17079594
 ] 

Tim Armstrong commented on IMPALA-9621:
---

I think we definitely need to share a lot of code with the existing table 
implementations, i.e. think it should use HdfsScanNode in the frontend and 
backend.

 

I don't know if it also makes sense to represent it as HdfsTable as far as 
metadata goes, because the partitioning scheme is very different. Some code 
would be shared, probably, but a lot of it might be very different.

> Support iceberg on hdfs
> ---
>
> Key: IMPALA-9621
> URL: https://issues.apache.org/jira/browse/IMPALA-9621
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: WangSheng
>Assignee: WangSheng
>Priority: Major
>
> We are investigating iceberg recently, and preparing to implement select 
> iceberg data by impala. Our production use hdfs, so we will try to support 
> iceberg on hdfs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-9612) Runtime filter wait longer than it should be

2020-04-09 Thread Riza Suminto (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Riza Suminto resolved IMPALA-9612.
--
Resolution: Fixed

The fix has been merged. Marking this Jira as resolved.

> Runtime filter wait longer than it should be
> 
>
> Key: IMPALA-9612
> URL: https://issues.apache.org/jira/browse/IMPALA-9612
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Riza Suminto
>Assignee: Riza Suminto
>Priority: Major
>
> In one of my query profile, I found an info string like this:
>   Runtime filters: All filters arrived. Waited 59s783ms. Maximum arrival 
> delay: 15s296ms.
> If all runtime filters arrived within 15s296ms, then it should not wait until 
> 59s783ms to proceed.
> Looking at runtime-filter.cc, It looks like there is a potential race 
> condition in function RuntimeFilter::WaitForArrival().
> bool RuntimeFilter::WaitForArrival(int32_t timeout_ms) const {
>   unique_lock l(arrival_mutex_);
>   while (arrival_time_.Load() == 0) {
> int64_t ms_since_registration = MonotonicMillis() - registration_time_;
> int64_t ms_remaining = timeout_ms - ms_since_registration;
> if (ms_remaining <= 0) break;
> arrival_cv_.WaitFor(l, ms_remaining * MICROS_PER_MILLI);
>   }
>   return arrival_time_.Load() != 0;
> }
> Between checking arrival_time_.Load() and calling arrival_cv_.WaitFor(), 
> arrival_cv_ might be already signaled by either RuntimeFilter::SetFilter() or 
> RuntimeFilter::Cancel() because they do not acquire arrival_mutex_ first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9635) Rewrite count(distinct) to use ds_hll_* functions when in BI mode

2020-04-09 Thread Gabor Kaszab (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079573#comment-17079573
 ] 

Gabor Kaszab commented on IMPALA-9635:
--

Most probably there is going to be a switch in Hive for the same purpose and it 
would make sense to have the same name for that in my opinion but it's not 
decided yet. And yes, eventually there will be a set of approximate functions 
if everything goes well that will be turned on with the same switch so 
APPX_COUNT_DISTINCT won't probably be the best for this purpose.
On the other hand, if the HLL from Datasketches appear way faster than our 
current HLL we can consider replacing the old one with the new.

> Rewrite count(distinct) to use ds_hll_* functions when in BI mode
> -
>
> Key: IMPALA-9635
> URL: https://issues.apache.org/jira/browse/IMPALA-9635
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Frontend
>Reporter: Gabor Kaszab
>Priority: Major
>
> This effort is dependent on a major FE work that is going on currently. The 
> task here is to verify that:
> 1) There is a switch to turn on BI mode
> 2) In BI mode count(distinct) queries are rewritten to ds_hll_* approx 
> functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-8857) test_kudu_col_not_null_changed may fail because client reads older timestamp

2020-04-09 Thread Thomas Tauber-Marshall (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-8857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Tauber-Marshall resolved IMPALA-8857.

Fix Version/s: Impala 4.0
   Resolution: Fixed

> test_kudu_col_not_null_changed may fail because client reads older timestamp
> 
>
> Key: IMPALA-8857
> URL: https://issues.apache.org/jira/browse/IMPALA-8857
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 3.3.0
>Reporter: Tim Armstrong
>Assignee: Thomas Tauber-Marshall
>Priority: Critical
>  Labels: flaky
> Fix For: Impala 4.0
>
>
> {noformat}
> uery_test/test_kudu.py:242: in test_kudu_col_not_null_changed
> assert len(cursor.fetchall()) == 100
> E   assert 61 == 100
> E+  where 61 = len([(0, None), (2, None), (4, None), (11, None), (12, 
> None), (19, None), ...])
> E+where [(0, None), (2, None), (4, None), (11, None), (12, None), 
> (19, None), ...] =  >()
> E+  where  > = 
> .fetchall
> {noformat}
> I believe this is a flaky tests, since there's no attempt to pass the 
> timestamp from the kudu client that did the insert to the impala client 
> that's doing the reading.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9612) Runtime filter wait longer than it should be

2020-04-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079498#comment-17079498
 ] 

ASF subversion and git services commented on IMPALA-9612:
-

Commit 5e69ae1d7dc113bbcc8d7d75e3b1b5244e76f76a in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=5e69ae1 ]

IMPALA-9612: Fix race condition in RuntimeFilter::WaitForArrival

In function RuntimeFilter::WaitForArrival, there is a race condition
where condition variable arrival_cv_ may be signaled right after
thread get into the loop and before it call arrival_cv_.WaitFor().
This can cause runtime filter to wait the entire
RUNTIME_FILTER_WAIT_TIME_MS even though the filter has arrived or
canceled earlier than that. This commit avoid the race condition by
making RuntimeFilter::SetFilter and RuntimeFilter::Cancel acquire
arrival_mutex_ first before checking the value of arrival_time_ and
release arrival_mutex_ before signaling arrival_cv_.

Testing:
- Add new be test runtime-filter-test.cc
- Pass core tests.

Change-Id: I7dffa626103ef0af06ad1e89231b0d2ee54bb94a
Reviewed-on: http://gerrit.cloudera.org:8080/15673
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Runtime filter wait longer than it should be
> 
>
> Key: IMPALA-9612
> URL: https://issues.apache.org/jira/browse/IMPALA-9612
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Riza Suminto
>Assignee: Riza Suminto
>Priority: Major
>
> In one of my query profile, I found an info string like this:
>   Runtime filters: All filters arrived. Waited 59s783ms. Maximum arrival 
> delay: 15s296ms.
> If all runtime filters arrived within 15s296ms, then it should not wait until 
> 59s783ms to proceed.
> Looking at runtime-filter.cc, It looks like there is a potential race 
> condition in function RuntimeFilter::WaitForArrival().
> bool RuntimeFilter::WaitForArrival(int32_t timeout_ms) const {
>   unique_lock l(arrival_mutex_);
>   while (arrival_time_.Load() == 0) {
> int64_t ms_since_registration = MonotonicMillis() - registration_time_;
> int64_t ms_remaining = timeout_ms - ms_since_registration;
> if (ms_remaining <= 0) break;
> arrival_cv_.WaitFor(l, ms_remaining * MICROS_PER_MILLI);
>   }
>   return arrival_time_.Load() != 0;
> }
> Between checking arrival_time_.Load() and calling arrival_cv_.WaitFor(), 
> arrival_cv_ might be already signaled by either RuntimeFilter::SetFilter() or 
> RuntimeFilter::Cancel() because they do not acquire arrival_mutex_ first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9618) Usability issues with dev env setup.

2020-04-09 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079499#comment-17079499
 ] 

ASF subversion and git services commented on IMPALA-9618:
-

Commit 5989900ae81a98d6977bdd60f2281da47e9f69b7 in impala's branch 
refs/heads/master from Tim Armstrong
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=5989900 ]

IMPALA-9618: fix some usability issues with dev env

Automatically assume IMPALA_HOME is the source directory
in a couple of places.

Delete the cache_tables.py script and MINI_DFS_BASE_DATA_DIR
config var which had both bit-rotted and were unused.

Allow setting IMPALA_CLUSTER_NODES_DIR to put the minicluster
nodes, most important the data, in a different location, e.g.
on a different filesystem.

Testing:
I set up a dev environment using this code and was able to
load data and run some tests.

Change-Id: Ibd8b42a6d045d73e3ea29015aa6ccbbde278eec7
Reviewed-on: http://gerrit.cloudera.org:8080/15687
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Usability issues with dev env setup.
> 
>
> Key: IMPALA-9618
> URL: https://issues.apache.org/jira/browse/IMPALA-9618
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
>
> bootstrap_system.sh should be able to auto-detect IMPALA_HOME. The only 
> reasonable default for IMPALA_HOME is the repository that the script is being 
> run from. We should detect that, instead of defaulting to ~/Impala, which 
> results in a weird error if you checked out Impala somewhere else.
> We should be able to put the minicluster data directory in a different place 
> from the source checkout, e.g. if you want to put it on a different disk.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-9618) Usability issues with dev env setup.

2020-04-09 Thread Tim Armstrong (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong resolved IMPALA-9618.
---
Fix Version/s: Impala 4.0
   Resolution: Fixed

> Usability issues with dev env setup.
> 
>
> Key: IMPALA-9618
> URL: https://issues.apache.org/jira/browse/IMPALA-9618
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Reporter: Tim Armstrong
>Assignee: Tim Armstrong
>Priority: Major
> Fix For: Impala 4.0
>
>
> bootstrap_system.sh should be able to auto-detect IMPALA_HOME. The only 
> reasonable default for IMPALA_HOME is the repository that the script is being 
> run from. We should detect that, instead of defaulting to ~/Impala, which 
> results in a weird error if you checked out Impala somewhere else.
> We should be able to put the minicluster data directory in a different place 
> from the source checkout, e.g. if you want to put it on a different disk.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9635) Rewrite count(distinct) to use ds_hll_* functions when in BI mode

2020-04-09 Thread Tim Armstrong (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079488#comment-17079488
 ] 

Tim Armstrong commented on IMPALA-9635:
---

Are you thinking of changing appx_count_distinct to use this different 
implementation? 
[https://impala.apache.org/docs/build/html/topics/impala_appx_count_distinct.html]
 .Will this be a new flag that enables a bigger set of approximation functions?

> Rewrite count(distinct) to use ds_hll_* functions when in BI mode
> -
>
> Key: IMPALA-9635
> URL: https://issues.apache.org/jira/browse/IMPALA-9635
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Frontend
>Reporter: Gabor Kaszab
>Priority: Major
>
> This effort is dependent on a major FE work that is going on currently. The 
> task here is to verify that:
> 1) There is a switch to turn on BI mode
> 2) In BI mode count(distinct) queries are rewritten to ds_hll_* approx 
> functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9615) Make re2's max_mem option configurable via an Impala startup flag.

2020-04-09 Thread Andrew Sherman (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Sherman updated IMPALA-9615:
---
Labels: backend ramp-up  (was: ramp-up)

> Make re2's max_mem option configurable via an Impala startup flag.
> --
>
> Key: IMPALA-9615
> URL: https://issues.apache.org/jira/browse/IMPALA-9615
> Project: IMPALA
>  Issue Type: Improvement
>Affects Versions: Impala 3.4.0
>Reporter: Attila Jeges
>Priority: Major
>  Labels: backend, ramp-up
>
> Right now Impala always uses the default max_mem value for re2 regexp pattern 
> matching.
> For more memory consuming patterns this can cause the following error:
> "re2/re2.cc:667: DFA out of memory: size x, bytemap range xx, list count 
> x".
> It would be nice if re2's max_mem option would be configurable via an Impala 
> startup flag.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9615) Make re2's max_mem option configurable via an Impala startup flag.

2020-04-09 Thread Andrew Sherman (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Sherman updated IMPALA-9615:
---
Labels: ramp-up  (was: )

> Make re2's max_mem option configurable via an Impala startup flag.
> --
>
> Key: IMPALA-9615
> URL: https://issues.apache.org/jira/browse/IMPALA-9615
> Project: IMPALA
>  Issue Type: Improvement
>Affects Versions: Impala 3.4.0
>Reporter: Attila Jeges
>Priority: Major
>  Labels: ramp-up
>
> Right now Impala always uses the default max_mem value for re2 regexp pattern 
> matching.
> For more memory consuming patterns this can cause the following error:
> "re2/re2.cc:667: DFA out of memory: size x, bytemap range xx, list count 
> x".
> It would be nice if re2's max_mem option would be configurable via an Impala 
> startup flag.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-9593) Implement count(distinct) function (DataSketches/HLL)

2020-04-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-9593 started by Gabor Kaszab.

> Implement count(distinct) function (DataSketches/HLL)
> -
>
> Key: IMPALA-9593
> URL: https://issues.apache.org/jira/browse/IMPALA-9593
> Project: IMPALA
>  Issue Type: Epic
>  Components: Backend
>Reporter: Boglarka Egyed
>Assignee: Gabor Kaszab
>Priority: Major
>
> Implement the count(distinct) function from the DataSketches library for HLL 
> in C++.
> General info about the sketch:
> http://datasketches.apache.org/docs/HLL/HLL.html
> C++ implementation to wrap:
> https://github.com/apache/incubator-datasketches-cpp/tree/master/hll



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9635) Rewrite count(distinct) to use ds_hll_* functions when in BI mode

2020-04-09 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-9635:


 Summary: Rewrite count(distinct) to use ds_hll_* functions when in 
BI mode
 Key: IMPALA-9635
 URL: https://issues.apache.org/jira/browse/IMPALA-9635
 Project: IMPALA
  Issue Type: New Feature
  Components: Frontend
Reporter: Gabor Kaszab


This effort is dependent on a major FE work that is going on currently. The 
task here is to verify that:
1) There is a switch to turn on BI mode
2) In BI mode count(distinct) queries are rewritten to ds_hll_* approx functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-9632) Implement ds_hll_sketch() and ds_hll_estimate() functions

2020-04-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-9632 started by Gabor Kaszab.

> Implement ds_hll_sketch() and ds_hll_estimate() functions
> -
>
> Key: IMPALA-9632
> URL: https://issues.apache.org/jira/browse/IMPALA-9632
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend, Frontend
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>
> These should be built-in functions that use DataSketches functionality that 
> was integrated by IMPALA-9631.
> ds_hll_sketch() should receive a primitive expression and return a sketch.
>  ds_hll_estimate() should receive a sketch and return a primitive that is the 
> cardinality estimate for that set of data provided to the sketch.
> Usage:
>  select ds_hll_estimate(ds_hll_sketch(col_name)) from table_name;
>  Returns a cardinality estimate (similarly to ndv() ) for that particular 
> column.
> Hive change that introduced the same: 
> https://issues.apache.org/jira/browse/HIVE-22940



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-9632) Implement ds_hll_sketch() and ds_hll_estimate() functions

2020-04-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab reassigned IMPALA-9632:


Assignee: Gabor Kaszab

> Implement ds_hll_sketch() and ds_hll_estimate() functions
> -
>
> Key: IMPALA-9632
> URL: https://issues.apache.org/jira/browse/IMPALA-9632
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend, Frontend
>Reporter: Gabor Kaszab
>Assignee: Gabor Kaszab
>Priority: Major
>
> These should be built-in functions that use DataSketches functionality that 
> was integrated by IMPALA-9631.
> ds_hll_sketch() should receive a primitive expression and return a sketch.
>  ds_hll_estimate() should receive a sketch and return a primitive that is the 
> cardinality estimate for that set of data provided to the sketch.
> Usage:
>  select ds_hll_estimate(ds_hll_sketch(col_name)) from table_name;
>  Returns a cardinality estimate (similarly to ndv() ) for that particular 
> column.
> Hive change that introduced the same: 
> https://issues.apache.org/jira/browse/HIVE-22940



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9633) Implement ds_hll_union() builtin function

2020-04-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-9633:
-
Description: 
ds_hll_union() is an aggregating function that accepts sketches and produces a 
single scratch that is the combination of the received scratches.

Example from Hive:
{code:java}
create temporary table sketch_intermediate (category char(1), sketch binary);
insert into sketch_intermediate select category, ds_hll_sketch(id) from 
sketch_input group by category;
select ds_hll_estimate(ds_hll_union(sketch)) from sketch_intermediate;
{code}
Some test data for the example:
{code:java}
create temporary table sketch_input (id int, category char(1));
insert into table sketch_input values
  (1, 'a'), (2, 'a'), (3, 'a'), (4, 'a'), (5, 'a'), (6, 'a'), (7, 'a'), (8, 
'a'), (9, 'a'), (10, 'a'),
  (6, 'b'), (7, 'b'), (8, 'b'), (9, 'b'), (10, 'b'), (11, 'b'), (12, 'b'), (13, 
'b'), (14, 'b'), (15, 'b');
{code}
Approximate result:
{code:java}
15.00521540663
{code}

Hive change that introduced the same: 
https://issues.apache.org/jira/browse/HIVE-22940

  was:
ds_hll_union() is an aggregating function that accepts sketches and produces a 
single scratch that is the combination of the received scratches.

Example from Hive:
{code:java}
create temporary table sketch_intermediate (category char(1), sketch binary);
insert into sketch_intermediate select category, ds_hll_sketch(id) from 
sketch_input group by category;
select ds_hll_estimate(ds_hll_union(sketch)) from sketch_intermediate;
{code}

Some test data for the example:
{code:java}
create temporary table sketch_input (id int, category char(1));
insert into table sketch_input values
  (1, 'a'), (2, 'a'), (3, 'a'), (4, 'a'), (5, 'a'), (6, 'a'), (7, 'a'), (8, 
'a'), (9, 'a'), (10, 'a'),
  (6, 'b'), (7, 'b'), (8, 'b'), (9, 'b'), (10, 'b'), (11, 'b'), (12, 'b'), (13, 
'b'), (14, 'b'), (15, 'b');
{code}

Approximate result:
{code:java}
15.00521540663
{code}



> Implement ds_hll_union() builtin function
> -
>
> Key: IMPALA-9633
> URL: https://issues.apache.org/jira/browse/IMPALA-9633
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend, Frontend
>Reporter: Gabor Kaszab
>Priority: Major
>
> ds_hll_union() is an aggregating function that accepts sketches and produces 
> a single scratch that is the combination of the received scratches.
> Example from Hive:
> {code:java}
> create temporary table sketch_intermediate (category char(1), sketch binary);
> insert into sketch_intermediate select category, ds_hll_sketch(id) from 
> sketch_input group by category;
> select ds_hll_estimate(ds_hll_union(sketch)) from sketch_intermediate;
> {code}
> Some test data for the example:
> {code:java}
> create temporary table sketch_input (id int, category char(1));
> insert into table sketch_input values
>   (1, 'a'), (2, 'a'), (3, 'a'), (4, 'a'), (5, 'a'), (6, 'a'), (7, 'a'), (8, 
> 'a'), (9, 'a'), (10, 'a'),
>   (6, 'b'), (7, 'b'), (8, 'b'), (9, 'b'), (10, 'b'), (11, 'b'), (12, 'b'), 
> (13, 'b'), (14, 'b'), (15, 'b');
> {code}
> Approximate result:
> {code:java}
> 15.00521540663
> {code}
> Hive change that introduced the same: 
> https://issues.apache.org/jira/browse/HIVE-22940



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9632) Implement ds_hll_sketch() and ds_hll_estimate() functions

2020-04-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-9632:
-
Description: 
These should be built-in functions that use DataSketches functionality that was 
integrated by IMPALA-9631.

ds_hll_sketch() should receive a primitive expression and return a sketch.
 ds_hll_estimate() should receive a sketch and return a primitive that is the 
cardinality estimate for that set of data provided to the sketch.

Usage:
 select ds_hll_estimate(ds_hll_sketch(col_name)) from table_name;
 Returns a cardinality estimate (similarly to ndv() ) for that particular 
column.

Hive change that introduced the same: 
https://issues.apache.org/jira/browse/HIVE-22940

  was:
These should be built-in functions that use DataSketches functionality that was 
integrated by [IMPALA-9631|https://issues.apache.org/jira/browse/IMPALA-9631].

ds_hll_sketch() should receive a primitive expression and return a sketch.
ds_hll_estimate() should receive a sketch and return a primitive that is the 
cardinality estimate for that set of data provided to the sketch.

Usage:
select ds_hll_estimate(ds_hll_sketch(col_name)) from table_name;
Returns a cardinality estimate (similarly to ndv() ) for that particular column.


> Implement ds_hll_sketch() and ds_hll_estimate() functions
> -
>
> Key: IMPALA-9632
> URL: https://issues.apache.org/jira/browse/IMPALA-9632
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend, Frontend
>Reporter: Gabor Kaszab
>Priority: Major
>
> These should be built-in functions that use DataSketches functionality that 
> was integrated by IMPALA-9631.
> ds_hll_sketch() should receive a primitive expression and return a sketch.
>  ds_hll_estimate() should receive a sketch and return a primitive that is the 
> cardinality estimate for that set of data provided to the sketch.
> Usage:
>  select ds_hll_estimate(ds_hll_sketch(col_name)) from table_name;
>  Returns a cardinality estimate (similarly to ndv() ) for that particular 
> column.
> Hive change that introduced the same: 
> https://issues.apache.org/jira/browse/HIVE-22940



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9634) Performance comparison between ndv() and ds_hll_* functions

2020-04-09 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-9634:


 Summary: Performance comparison between ndv() and ds_hll_* 
functions
 Key: IMPALA-9634
 URL: https://issues.apache.org/jira/browse/IMPALA-9634
 Project: IMPALA
  Issue Type: New Feature
  Components: Perf Investigation
Reporter: Gabor Kaszab






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9633) Implement ds_hll_union() builtin function

2020-04-09 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-9633:


 Summary: Implement ds_hll_union() builtin function
 Key: IMPALA-9633
 URL: https://issues.apache.org/jira/browse/IMPALA-9633
 Project: IMPALA
  Issue Type: New Feature
  Components: Backend, Frontend
Reporter: Gabor Kaszab


ds_hll_union() is an aggregating function that accepts sketches and produces a 
single scratch that is the combination of the received scratches.

Example from Hive:
{code:java}
create temporary table sketch_intermediate (category char(1), sketch binary);
insert into sketch_intermediate select category, ds_hll_sketch(id) from 
sketch_input group by category;
select ds_hll_estimate(ds_hll_union(sketch)) from sketch_intermediate;
{code}

Some test data for the example:
{code:java}
create temporary table sketch_input (id int, category char(1));
insert into table sketch_input values
  (1, 'a'), (2, 'a'), (3, 'a'), (4, 'a'), (5, 'a'), (6, 'a'), (7, 'a'), (8, 
'a'), (9, 'a'), (10, 'a'),
  (6, 'b'), (7, 'b'), (8, 'b'), (9, 'b'), (10, 'b'), (11, 'b'), (12, 'b'), (13, 
'b'), (14, 'b'), (15, 'b');
{code}

Approximate result:
{code:java}
15.00521540663
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9629) Extend bootstrap_system.sh to support CentOS 8

2020-04-09 Thread Laszlo Gaal (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079304#comment-17079304
 ] 

Laszlo Gaal commented on IMPALA-9629:
-

Review: https://gerrit.cloudera.org/c/15623/10

> Extend bootstrap_system.sh to support CentOS 8
> --
>
> Key: IMPALA-9629
> URL: https://issues.apache.org/jira/browse/IMPALA-9629
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Infrastructure
>Affects Versions: Impala 4.0
>Reporter: Laszlo Gaal
>Assignee: Laszlo Gaal
>Priority: Major
>
> Red Hat / CentOS 8 is the new generation of the most popular enterprise Linux 
> distribution. Impala should support development on this platform as well.
> Extending the system setup logic in bootstrap_system.sh is the first step for 
> this support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9626) Use Python 2.7 from toolchain

2020-04-09 Thread Laszlo Gaal (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079305#comment-17079305
 ] 

Laszlo Gaal commented on IMPALA-9626:
-

Review: https://gerrit.cloudera.org/c/15624/13

> Use Python 2.7 from toolchain
> -
>
> Key: IMPALA-9626
> URL: https://issues.apache.org/jira/browse/IMPALA-9626
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Infrastructure
>Reporter: Tim Armstrong
>Assignee: Laszlo Gaal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Work started] (IMPALA-9629) Extend bootstrap_system.sh to support CentOS 8

2020-04-09 Thread Laszlo Gaal (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-9629 started by Laszlo Gaal.
---
> Extend bootstrap_system.sh to support CentOS 8
> --
>
> Key: IMPALA-9629
> URL: https://issues.apache.org/jira/browse/IMPALA-9629
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Infrastructure
>Affects Versions: Impala 4.0
>Reporter: Laszlo Gaal
>Assignee: Laszlo Gaal
>Priority: Major
>
> Red Hat / CentOS 8 is the new generation of the most popular enterprise Linux 
> distribution. Impala should support development on this platform as well.
> Extending the system setup logic in bootstrap_system.sh is the first step for 
> this support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9632) Implement ds_hll_sketch() and ds_hll_estimate() functions

2020-04-09 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-9632:


 Summary: Implement ds_hll_sketch() and ds_hll_estimate() functions
 Key: IMPALA-9632
 URL: https://issues.apache.org/jira/browse/IMPALA-9632
 Project: IMPALA
  Issue Type: New Feature
  Components: Backend, Frontend
Reporter: Gabor Kaszab


These should be built-in functions that use DataSketches functionality that was 
integrated by [IMPALA-9631|https://issues.apache.org/jira/browse/IMPALA-9631].

ds_hll_sketch() should receive a primitive expression and return a sketch.
ds_hll_estimate() should receive a sketch and return a primitive that is the 
cardinality estimate for that set of data provided to the sketch.

Usage:
select ds_hll_estimate(ds_hll_sketch(col_name)) from table_name;
Returns a cardinality estimate (similarly to ndv() ) for that particular column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9631) Import HLL functionality from DataSketches

2020-04-09 Thread Gabor Kaszab (Jira)
Gabor Kaszab created IMPALA-9631:


 Summary: Import HLL functionality from DataSketches
 Key: IMPALA-9631
 URL: https://issues.apache.org/jira/browse/IMPALA-9631
 Project: IMPALA
  Issue Type: New Feature
  Components: Infrastructure
Reporter: Gabor Kaszab


Import HLL from DataSketches into Impala:
https://github.com/apache/incubator-datasketches-cpp/tree/master/hll/include

Include it in the build system and make sure that hll.hpp can be included into 
Impala source files and also verify with tests that the basic functionality 
such as sketch creation, update() and get_result() can be invoked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-9593) Implement count(distinct) function (DataSketches/HLL)

2020-04-09 Thread Gabor Kaszab (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-9593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab updated IMPALA-9593:
-
Issue Type: Epic  (was: New Feature)

> Implement count(distinct) function (DataSketches/HLL)
> -
>
> Key: IMPALA-9593
> URL: https://issues.apache.org/jira/browse/IMPALA-9593
> Project: IMPALA
>  Issue Type: Epic
>  Components: Backend
>Reporter: Boglarka Egyed
>Assignee: Gabor Kaszab
>Priority: Major
>
> Implement the count(distinct) function from the DataSketches library for HLL 
> in C++.
> General info about the sketch:
> http://datasketches.apache.org/docs/HLL/HLL.html
> C++ implementation to wrap:
> https://github.com/apache/incubator-datasketches-cpp/tree/master/hll



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-9630) Keep blocking queue cache line aligned on aarch64

2020-04-09 Thread zhaorenhai (Jira)
zhaorenhai created IMPALA-9630:
--

 Summary: Keep blocking queue cache line aligned on aarch64
 Key: IMPALA-9630
 URL: https://issues.apache.org/jira/browse/IMPALA-9630
 Project: IMPALA
  Issue Type: Sub-task
Reporter: zhaorenhai
Assignee: zhaorenhai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9621) Support iceberg on hdfs

2020-04-09 Thread WangSheng (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079038#comment-17079038
 ] 

WangSheng commented on IMPALA-9621:
---

[~tarmstrong] Hi Tim, here is the quick start of iceberg api: 
[create-table|https://iceberg.apache.org/api-quickstart/#create-a-table]. And 
I've already read the iceberg source code, when use HiveCatalog to create 
table, iceberg will call HiveMetaStoreClient to create a table in HMS, you can 
found the code in 
[HiveTableOperations.doCommit()|https://github.com/apache/incubator-iceberg/blob/master/hive/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java].
 I will test this on my local environment lately, and also try to stay 
consistent if possible.

> Support iceberg on hdfs
> ---
>
> Key: IMPALA-9621
> URL: https://issues.apache.org/jira/browse/IMPALA-9621
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: WangSheng
>Assignee: WangSheng
>Priority: Major
>
> We are investigating iceberg recently, and preparing to implement select 
> iceberg data by impala. Our production use hdfs, so we will try to support 
> iceberg on hdfs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-9621) Support iceberg on hdfs

2020-04-09 Thread WangSheng (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-9621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079031#comment-17079031
 ] 

WangSheng commented on IMPALA-9621:
---

[~stakiar]Thanks for your suggestion, Sahil. I've already read the code in 
IMPALA-8778 several days age. This path support impala read Hudi optimized 
table by treat HUDI_PARQUET as another special parquet. When handle with 
HUDI_PARQUET, impala just filter and then treat as an normal parquet. My 
opinion is to treat iceberg as a new data source such as kudu, HBase, so we 
could create/drop/alter/select iceberg table by impala.

> Support iceberg on hdfs
> ---
>
> Key: IMPALA-9621
> URL: https://issues.apache.org/jira/browse/IMPALA-9621
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: WangSheng
>Assignee: WangSheng
>Priority: Major
>
> We are investigating iceberg recently, and preparing to implement select 
> iceberg data by impala. Our production use hdfs, so we will try to support 
> iceberg on hdfs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org