[jira] [Updated] (IMPALA-7027) Multiple Cast to Varchar with different limit fails with "AnalysisException: null CAUSED BY: IllegalArgumentException: "

2018-05-14 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-7027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm updated IMPALA-7027:
---
Description: 
If we have multiple cast of '' to varchar statements in a impala query which 
has a distinct like below, the query breaks for scenario when the cast to 
varchar limit in the SQL is lower than the previous cast.

 

Query 1> Fails with " AnalysisException: null CAUSED BY: 
IllegalArgumentException: targetType=VARCHAR(100) type=VARCHAR(101)"

SELECT DISTINCT CAST('' as VARCHAR(101)) as CL_COMMENTS,CAST('' as 
VARCHAR(100))  as CL_USER_ID FROM tablename limit 1

Where as the below query succeeds

Query 2> Success

 SELECT DISTINCT CAST('' as VARCHAR(100)) as CL_COMMENTS,CAST('' as 
VARCHAR(101))  as CL_USER_ID FROM  tablename limit 1

*Workaround*
SET ENABLE_EXPR_REWRITES=false;

  was:
If we have multiple cast of '' to varchar statements in a impala query which 
has a distinct like below, the query breaks for scenario when the cast to 
varchar limit in the SQL is lower than the previous cast.

 

Query 1> Fails with " AnalysisException: null CAUSED BY: 
IllegalArgumentException: targetType=VARCHAR(100) type=VARCHAR(101)"

SELECT DISTINCT CAST('' as VARCHAR(101)) as CL_COMMENTS,CAST('' as 
VARCHAR(100))  as CL_USER_ID FROM tablename limit 1

Where as the below query succeeds

Query 2> Success

 SELECT DISTINCT CAST('' as VARCHAR(100)) as CL_COMMENTS,CAST('' as 
VARCHAR(101))  as CL_USER_ID FROM  tablename limit 1

The Query succeeds is we remove the distinct clause or add set 
enable_expr_rewrites=false;


> Multiple Cast to Varchar with different limit fails with "AnalysisException: 
> null CAUSED BY: IllegalArgumentException: "
> 
>
> Key: IMPALA-7027
> URL: https://issues.apache.org/jira/browse/IMPALA-7027
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 3.0, Impala 2.12.0
>Reporter: Meenakshi
>Priority: Critical
>  Labels: planner, regression
>
> If we have multiple cast of '' to varchar statements in a impala query which 
> has a distinct like below, the query breaks for scenario when the cast to 
> varchar limit in the SQL is lower than the previous cast.
>  
> Query 1> Fails with " AnalysisException: null CAUSED BY: 
> IllegalArgumentException: targetType=VARCHAR(100) type=VARCHAR(101)"
> SELECT DISTINCT CAST('' as VARCHAR(101)) as CL_COMMENTS,CAST('' as 
> VARCHAR(100))  as CL_USER_ID FROM tablename limit 1
> Where as the below query succeeds
> Query 2> Success
>  SELECT DISTINCT CAST('' as VARCHAR(100)) as CL_COMMENTS,CAST('' as 
> VARCHAR(101))  as CL_USER_ID FROM  tablename limit 1
> *Workaround*
> SET ENABLE_EXPR_REWRITES=false;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-7027) Multiple Cast to Varchar with different limit fails with "AnalysisException: null CAUSED BY: IllegalArgumentException: "

2018-05-14 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-7027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm updated IMPALA-7027:
---
Affects Version/s: (was: Impala 2.9.0)
   Impala 3.0
   Impala 2.12.0
   Labels: planner regression  (was: )

> Multiple Cast to Varchar with different limit fails with "AnalysisException: 
> null CAUSED BY: IllegalArgumentException: "
> 
>
> Key: IMPALA-7027
> URL: https://issues.apache.org/jira/browse/IMPALA-7027
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 3.0, Impala 2.12.0
>Reporter: Meenakshi
>Priority: Critical
>  Labels: planner, regression
>
> If we have multiple cast of '' to varchar statements in a impala query which 
> has a distinct like below, the query breaks for scenario when the cast to 
> varchar limit in the SQL is lower than the previous cast.
>  
> Query 1> Fails with " AnalysisException: null CAUSED BY: 
> IllegalArgumentException: targetType=VARCHAR(100) type=VARCHAR(101)"
> SELECT DISTINCT CAST('' as VARCHAR(101)) as CL_COMMENTS,CAST('' as 
> VARCHAR(100))  as CL_USER_ID FROM tablename limit 1
> Where as the below query succeeds
> Query 2> Success
>  SELECT DISTINCT CAST('' as VARCHAR(100)) as CL_COMMENTS,CAST('' as 
> VARCHAR(101))  as CL_USER_ID FROM  tablename limit 1
> The Query succeeds is we remove the distinct clause or add set 
> enable_expr_rewrites=false;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-6617) Preconditions.checkState(val.getColValsSize() == 1); in EvalExprWithoutRow()

2018-05-09 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6617.

Resolution: Cannot Reproduce

Haven't seen this in a long time, my guess is that this was caused by a memory 
corruption issue which has since been fixed.

> Preconditions.checkState(val.getColValsSize() == 1); in EvalExprWithoutRow()
> 
>
> Key: IMPALA-6617
> URL: https://issues.apache.org/jira/browse/IMPALA-6617
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 3.0, Impala 2.12.0
>Reporter: Michael Ho
>Assignee: Alexander Behm
>Priority: Blocker
>  Labels: broken-build
>
> Hit this when running an exhaustive build at this 
> [commit|https://github.com/apache/impala/commit/5f2f445e7d29ed26f6818b5c41edda2fe7c49b59].
>  KRPC was disabled.
> {noformat}
> query_test/test_queries.py:111: in test_subquery
> self.run_test_case('QueryTest/subquery', vector)
> common/impala_test_suite.py:397: in run_test_case
> result = self.__execute_query(target_impalad_client, query, user=user)
> common/impala_test_suite.py:612: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:160: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:173: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:339: in __execute_query
> handle = self.execute_query_async(query_string, user=user)
> beeswax/impala_beeswax.py:335: in execute_query_async
> return self.__do_rpc(lambda: self.imp_service.query(query,))
> beeswax/impala_beeswax.py:460: in __do_rpc
> raise ImpalaBeeswaxException(self.__build_error_message(b), b)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> EINNER EXCEPTION: 
> EMESSAGE: IllegalStateException: null
> Standard Error
> -- executing against localhost:21000
> use functional_rc_def;
> SET disable_codegen_rows_threshold=0;
> SET disable_codegen=False;
> SET abort_on_error=1;
> SET exec_single_node_rows_threshold=100;
> SET batch_size=0;
> SET num_nodes=0;
> -- executing against localhost:21000
> select a.id, a.int_col, a.string_col
> from functional.alltypessmall a
> where a.id in (select id from functional.alltypestiny where bool_col = false)
> and a.id < 5;
> {noformat}
> {noformat}
> I0306 10:59:28.357015 15877 Frontend.java:952] Analyzing query: select a.id, 
> a.int_col, a.string_col
> from functional.alltypessmall a
> where a.id in (select id from functional.alltypestiny where bool_col = false)
> and a.id < 5
> I0306 10:59:28.358779 15877 Frontend.java:964] Analysis finished.
> I0306 10:59:28.603314 27013 data-stream-mgr.cc:236] Reduced stream ID cache 
> from 1891 items, to 1887, eviction took: 0
> I0306 10:59:29.028396 15877 jni-util.cc:230] java.lang.IllegalStateException
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
> at 
> org.apache.impala.service.FeSupport.EvalExprWithoutRow(FeSupport.java:169)
> at 
> org.apache.impala.service.FeSupport.EvalPredicate(FeSupport.java:218)
> at 
> org.apache.impala.analysis.Analyzer.isTrueWithNullSlots(Analyzer.java:1917)
> at 
> org.apache.impala.planner.HdfsScanNode.addDictionaryFilter(HdfsScanNode.java:659)
> at 
> org.apache.impala.planner.HdfsScanNode.computeDictionaryFilterConjuncts(HdfsScanNode.java:685)
> at org.apache.impala.planner.HdfsScanNode.init(HdfsScanNode.java:329)
> at 
> org.apache.impala.planner.SingleNodePlanner.createHdfsScanPlan(SingleNodePlanner.java:1255)
> at 
> org.apache.impala.planner.SingleNodePlanner.createScanNode(SingleNodePlanner.java:1298)
> at 
> org.apache.impala.planner.SingleNodePlanner.createTableRefNode(SingleNodePlanner.java:1506)
> at 
> org.apache.impala.planner.SingleNodePlanner.createTableRefsPlan(SingleNodePlanner.java:776)
> at 
> org.apache.impala.planner.SingleNodePlanner.createSelectPlan(SingleNodePlanner.java:614)
> at 
> org.apache.impala.planner.SingleNodePlanner.createQueryPlan(SingleNodePlanner.java:257)
> at 
> org.apache.impala.planner.SingleNodePlanner.createSingleNodePlan(SingleNodePlanner.java:147)
> at org.apache.impala.planner.Planner.createPlan(Planner.java:101)
> at 
> org.apache.impala.service.Frontend.createExecRequest(Frontend.java:896)
> at 
> org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1017)
> at 
> org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:156)
> I0306 10:59:30.051102 15877 status.cc:125] IllegalStateException: null
> @  0x1676fad  impala::Status::Status()
>   

[jira] [Assigned] (IMPALA-6994) Avoid reloading a table's HMS data for file-only operations

2018-05-09 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm reassigned IMPALA-6994:
--

Assignee: Pranay Singh

> Avoid reloading a table's HMS data for file-only operations
> ---
>
> Key: IMPALA-6994
> URL: https://issues.apache.org/jira/browse/IMPALA-6994
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Affects Versions: Impala 2.12.0
>Reporter: Balazs Jeszenszky
>Assignee: Pranay Singh
>Priority: Major
>
> Reloading file metadata for HDFS tables (e.g. as a final step in an 'insert') 
> is done via
> https://github.com/apache/impala/blob/branch-2.12.0/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java#L628
> , which calls
> https://github.com/apache/impala/blob/branch-2.12.0/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L1243
> HdfsTable.load has no option to only load file metadata. HMS metadata will 
> also be reloaded every time, which is an unnecessary overhead (and potential 
> point of failure) when adding files to existing locations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-7001) Privilege inconsistency between SHOW TABLES and SHOW FUNCTIONS

2018-05-09 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-7001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm updated IMPALA-7001:
---
Labels: security  (was: )

> Privilege inconsistency between SHOW TABLES and SHOW FUNCTIONS
> --
>
> Key: IMPALA-7001
> URL: https://issues.apache.org/jira/browse/IMPALA-7001
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.10.0, Impala 2.11.0, Impala 2.12.0
>Reporter: Fredy Wijaya
>Priority: Major
>  Labels: security
>
>  
> {noformat}
> > grant create on database functional to role;
> > show tables in functional; -- this is allowed
> > show functions in functional;
> ERROR: AuthorizationException: User 'impdev' does not have privileges to 
> access: functional
> {noformat}
> In "show tables", we use ANY privilege whereas we use VIEW_METADATA in "show 
> functions".
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-6994) Avoid reloading a table's HMS data for file-only operations

2018-05-08 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm updated IMPALA-6994:
---
Component/s: (was: Frontend)
 Catalog

> Avoid reloading a table's HMS data for file-only operations
> ---
>
> Key: IMPALA-6994
> URL: https://issues.apache.org/jira/browse/IMPALA-6994
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Affects Versions: Impala 2.12.0
>Reporter: Balazs Jeszenszky
>Priority: Major
>
> Reloading file metadata for HDFS tables (e.g. as a final step in an 'insert') 
> is done via
> https://github.com/apache/impala/blob/branch-2.12.0/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java#L628
> , which calls
> https://github.com/apache/impala/blob/branch-2.12.0/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L1243
> HdfsTable.load has no option to only load file metadata. HMS metadata will 
> also be reloaded every time, which is an unnecessary overhead (and potential 
> point of failure) when adding files to existing locations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-5990) End-to-end compression of metadata

2018-05-08 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/IMPALA-5990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467660#comment-16467660
 ] 

Alexander Behm commented on IMPALA-5990:


[~arodoni_cloudera], this is an improvement of an existing configuration 
"--compact_catalog_topic", so it is not a new user-facing feature. That said, 
it would be nice to briefly mention it in the release notes as an improvement 
to metadata handling.

> End-to-end compression of metadata
> --
>
> Key: IMPALA-5990
> URL: https://issues.apache.org/jira/browse/IMPALA-5990
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog, Frontend
>Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0
>Reporter: Alexander Behm
>Assignee: Tianyi Wang
>Priority: Critical
> Fix For: Impala 2.12.0
>
>
> The metadata of large tables can become quite big making it costly to hold in 
> the statestore and disseminate to coordinator impalads. The metadata can even 
> get so big that fundamental limits like the JVM 2GB array size and the Thrift 
> 4GB are hit and lead to downtime.
> For reducing the statestore metadata topic size we have an existing 
> "compact_catalog_topic" flag which LZ4 compresses the metadata payload for 
> the C++ codepaths catalogd->statestore and statestore->impalad.
> Unfortunately, the metadata is not compressed in the same way during the 
> FE->BE transition on the catalogd and the BE->FE transition on the impalad.
> The goal of this change is to enable end-to-end compression for the full path 
> of metadata dissemination. The existing code paths also need significant 
> cleanup/streamlining. Ideally, the new code should provide consistent size 
> limits everywhere.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-3063) NotImplementedException: ... RIGHT OUTER JOIN ... can only be executed with a single node plan

2018-05-02 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/IMPALA-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461827#comment-16461827
 ] 

Alexander Behm commented on IMPALA-3063:


Some variants of this bug are not fixed by this commit. See: IMPALA-5689

> NotImplementedException: ... RIGHT OUTER JOIN ... can only be executed with a 
> single node plan
> --
>
> Key: IMPALA-3063
> URL: https://issues.apache.org/jira/browse/IMPALA-3063
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.5.0, Impala 2.6.0
>Reporter: casey
>Assignee: Alexander Behm
>Priority: Blocker
>  Labels: regression
> Fix For: Impala 2.7.0
>
>
> Please see below.
> {noformat}
> Query: select 1
> FROM alltypes a1
> LEFT JOIN alltypes a2 ON a2.tinyint_col >= 1
> ERROR: NotImplementedException: Error generating a valid execution plan for 
> this query. A RIGHT OUTER JOIN type with no equi-join predicates can only be 
> executed with a single node plan.
> {noformat}
> The error message must be referencing some rewritten version of the query? 
> The original query has no RIGHT OUTER JOIN. 
> Also the message references "single node plan" which users may not understand.
> What's worse is if you "SET NUM_NODES=1" there is a dcheck. I'll file a 
> separate issue about that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-6934) Wrong results with EXISTS subquery containing ORDER BY, LIMIT, and OFFSET

2018-04-25 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6934:
--

 Summary: Wrong results with EXISTS subquery containing ORDER BY, 
LIMIT, and OFFSET
 Key: IMPALA-6934
 URL: https://issues.apache.org/jira/browse/IMPALA-6934
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Affects Versions: Impala 2.11.0, Impala 2.10.0, Impala 2.9.0, Impala 2.8.0, 
Impala 2.7.0, Impala 2.6.0, Impala 2.5.0, Impala 2.12.0
Reporter: Alexander Behm


Queries may return wrong results if an EXISTS subquery has an ORDER BY with a 
LIMIT and OFFSET clause. The EXISTS subquery may incorrectly evaluate to TRUE 
even though it s FALSE.

Reproduction:
{code}
select count(*) from functional.alltypestiny t where
exists (select id from functional.alltypestiny where id < 5
   order by id limit 10 offset 6);
{code}
The query should return "0" but it incorrectly returns "8" because an incorrect 
plan without the offset is generated. See plan:
{code}
+-+
| Explain String  |
+-+
| Max Per-Host Resource Reservation: Memory=0B|
| Per-Host Resource Estimates: Memory=84.00MB |
| Codegen disabled by planner |
| |
| PLAN-ROOT SINK  |
| |   |
| 08:AGGREGATE [FINALIZE] |
| |  output: count:merge(*)   |
| |   |
| 07:EXCHANGE [UNPARTITIONED] |
| |   |
| 04:AGGREGATE|
| |  output: count(*) |
| |   |
| 03:NESTED LOOP JOIN [LEFT SEMI JOIN, BROADCAST] |
| |   |
| |--06:EXCHANGE [BROADCAST]  |
| |  ||
| |  05:MERGING-EXCHANGE [UNPARTITIONED]  |
| |  |  order by: id ASC  |
| |  |  limit: 1  |
| |  ||
| |  02:TOP-N [LIMIT=1]   |
| |  |  order by: id ASC  |
| |  ||
| |  01:SCAN HDFS [functional.alltypestiny]   |
| | partitions=4/4 files=4 size=460B  |
| | predicates: id < 5|
| |   |
| 00:SCAN HDFS [functional.alltypestiny t]|
|partitions=4/4 files=4 size=460B |
+-+
{code}

Evaluating the subquery by itself gives the expected results:
{code}
select id from functional.alltypestiny where id < 5 order by id limit 10 offset 
6;

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-6735) Inconsistent query submission, start, and end times in query profile

2018-04-04 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6735.

Resolution: Not A Bug

> Inconsistent query submission, start, and end times in query profile
> 
>
> Key: IMPALA-6735
> URL: https://issues.apache.org/jira/browse/IMPALA-6735
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 2.11.0
>Reporter: Alexander Behm
>Priority: Major
>  Labels: supportability
>
> We've sometimes observed inconsistencies in the following runtime profile 
> entrues:
> * Query submitted (timeline event)
> * Start Time (info string, not a timeline event)
> * End Time (into string, not a timeline event)
> Here is one inconsistent example:
> {code}
> Query submitted at: 2018-03-22 10:27:57
> Start Time: 2018-03-22 10:28:05.883997000 
> End Time  : 2018-03-22 10:28:05.915566000
> {code}
> Based on the backend code it should not be possible that "Start Time" happens 
> after "Query submitted". The relevant code snipped is in impala-server.cc 
> ImpalaServer::ExecuteInternal():
> {code}
> ...
> // Sets the Start Time
>   request_state->reset(new ClientRequestState(query_ctx, exec_env_, 
> exec_env_->frontend(),
>   this, session_state));
> // Sets the query submitted time
>   (*request_state)->query_events()->MarkEvent("Query submitted");
> ...
> {code}
> One possible explanation could be that these events get the current time from 
> different functions:
> * The timeline events use our MonotonicStopWatch
> * The "Start Time" and "End Time" use UnixMicros() from our own time.h
> It's not clear that these produce consistent timings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-6780) test_recover_paritions.py have always-true asserts

2018-04-03 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6780.

   Resolution: Fixed
Fix Version/s: Impala 2.13.0
   Impala 3.0

commit d478c492bd4e1bac4ab84d63eafb788940e49d53
Author: Alex Behm 
Date:   Mon Apr 2 10:56:00 2018 -0700

IMPALA-6780: Fix always-true asserts in test_recover_partitions.py

Fixes the following syntax issue leading to an always-true assert:
assert (cond, msg) <-- always true

Instead, this should be used:
assert cond, msg <-- cond and msg can optionally have parenthesis

Testing:
- Locally ran test_recover_partitions.py
- Searched for other tests with similar mistakes (none identified)

Change-Id: If38efa62c2496b69ae891bf916482d697bfc719f
Reviewed-on: http://gerrit.cloudera.org:8080/9886
Tested-by: Impala Public Jenkins
Reviewed-by: Alex Behm 


> test_recover_paritions.py have always-true asserts
> --
>
> Key: IMPALA-6780
> URL: https://issues.apache.org/jira/browse/IMPALA-6780
> Project: IMPALA
>  Issue Type: Task
>Reporter: Philip Zeyliger
>Assignee: Alexander Behm
>Priority: Major
> Fix For: Impala 3.0, Impala 2.13.0
>
>
> I discovered in the process of looking at IMPALA-6453 that we have some 
> assertions that the Python 2.6 compiler thinks are always true. This seems 
> like a true test bug.
> {code}
> /tmp/zz/mnt/tests/metadata/test_recover_partitions.py:91: SyntaxWarning: 
> assertion is always true, perhaps remove parentheses?
>   assert (self.has_value(PART_NAME, result.data),
> /tmp/zz/mnt/tests/metadata/test_recover_partitions.py:95: SyntaxWarning: 
> assertion is always true, perhaps remove parentheses?
>   assert (self.has_value(INSERTED_VALUE, result.data),
> /tmp/zz/mnt/tests/metadata/test_recover_partitions.py:108: SyntaxWarning: 
> assertion is always true, perhaps remove parentheses?
>   assert (len(result.data) == old_length,
> /tmp/zz/mnt/tests/metadata/test_recover_partitions.py:124: SyntaxWarning: 
> assertion is always true, perhaps remove parentheses?
>   assert (self.has_value("NULL", result.data),
> /tmp/zz/mnt/tests/metadata/test_recover_partitions.py:255: SyntaxWarning: 
> assertion is always true, perhaps remove parentheses?
>   assert ((old_length + 1) == len(result.data),
> /tmp/zz/mnt/tests/metadata/test_recover_partitions.py:288: SyntaxWarning: 
> assertion is always true, perhaps remove parentheses?
>   assert (self.has_value(INSERTED_VALUE, result.data),
> /tmp/zz/mnt/tests/metadata/test_recover_partitions.py:332: SyntaxWarning: 
> assertion is always true, perhaps remove parentheses?
>   assert (len(result.data) == (old_length + 1),
> /tmp/zz/mnt/tests/metadata/test_recover_partitions.py:362: SyntaxWarning: 
> assertion is always true, perhaps remove parentheses?
>   assert (self.count_partition(result.data) == 1,
> /tmp/zz/mnt/tests/metadata/test_recover_partitions.py:365: SyntaxWarning: 
> assertion is always true, perhaps remove parentheses?
>   assert (self.count_value('p=100%25', result.data) == 1,
> /tmp/zz/mnt/tests/metadata/test_recover_partitions.py:389: SyntaxWarning: 
> assertion is always true, perhaps remove parentheses?
>   assert (len(result.data) == old_length,
> {code}
> {code}
> >>> assert (False, "hey")
> :1: SyntaxWarning: assertion is always true, perhaps remove 
> parentheses?
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-6735) Inconsistent query submission, start, and end times in query profile

2018-03-24 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6735:
--

 Summary: Inconsistent query submission, start, and end times in 
query profile
 Key: IMPALA-6735
 URL: https://issues.apache.org/jira/browse/IMPALA-6735
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Affects Versions: Impala 2.11.0
Reporter: Alexander Behm


We've sometimes observed inconsistencies in the following runtime profile 
entrues:
* Query submitted (timeline event)
* Start Time (info string, not a timeline event)
* End Time (into string, not a timeline event)

Here is one inconsistent example:
{code}
Query submitted at  : 2018-03-22 10:27:57
Start Time  : 2018-03-22 10:28:05.883997000 
End Time: 2018-03-22 10:28:05.915566000
{code}

Based on the backend code it should not be possible that "Start Time" happens 
after "Query submitted". The relevant code snipped is in impala-server.cc 
ImpalaServer::ExecuteInternal():
{code}
...
// Sets the Start Time
  request_state->reset(new ClientRequestState(query_ctx, exec_env_, 
exec_env_->frontend(),
  this, session_state));
// Sets the query submitted time
  (*request_state)->query_events()->MarkEvent("Query submitted");
...
{code}

One possible explanation could be that these events get the current time from 
different functions:
* The timeline events use our MonotonicStopWatch
* The "Start Time" and "End Time" use UnixMicros() from our own time.h

It's not clear that these produce consistent timings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-6683) Restarting the Catalog without restarting Impalad and SS can block topic updates

2018-03-16 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6683.

   Resolution: Fixed
Fix Version/s: Impala 2.12.0

commit 0532ff97c80aad08e800fede78348653e57a872f
Author: Tianyi Wang 
Date:   Thu Mar 15 18:25:54 2018 -0700

IMPALA-6683: Fix infinite loop after restarting the catalog

Currently the catalog service ID topic item includes the ID string.
It causes the coexistence of multiple catalog service ID topic items
after the catalogd restarts. Impalad therefore keeps detecting the
change of catalog service ID and requests a full catalog update. This
patch uses "CATALOG_SERVICE_ID" as the topic item name instead.

With the patch impalad prints only one line of catalog change log after
the catalog restarts.

Change-Id: I1ee6c6477458e0f4dd31b12daa9ed5f146d84e7b
Reviewed-on: http://gerrit.cloudera.org:8080/9684
Reviewed-by: Alex Behm 
Reviewed-by: Dimitris Tsirogiannis 
Tested-by: Impala Public Jenkins


> Restarting the Catalog without restarting Impalad and SS can block topic 
> updates
> 
>
> Key: IMPALA-6683
> URL: https://issues.apache.org/jira/browse/IMPALA-6683
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Affects Versions: Impala 2.12.0
>Reporter: Mostafa Mokhtar
>Assignee: Tianyi Wang
>Priority: Blocker
> Fix For: Impala 2.12.0
>
>
> After restarting the Catalog without restarting the other services topic 
> updates are getting stuck
> From SS
> {code}
> I0315 17:20:44.305951 120990 statestore.cc:242] Preparing initial 
> catalog-update topic update for impa...@va1031.foo:22000. Size = 50.18 MB
> I0315 17:20:44.305981 120974 statestore.cc:242] Preparing initial 
> catalog-update topic update for impa...@vd1120.foo:22000. Size = 50.18 MB
> I0315 17:20:44.305982 120975 statestore.cc:242] Preparing initial 
> catalog-update topic update for impa...@vc1518.foo:22000. Size = 50.18 MB
> I0315 17:20:44.343500 121001 statestore.cc:652] Received request for 
> different delta base of topic: catalog-update from: impa...@vd1107.foo:22000 
> subscriber from_version: 0
> I0315 17:20:44.355831 120963 statestore.cc:242] Preparing initial 
> catalog-update topic update for impa...@ve1134.foo:22000. Size = 50.18 MB
> I0315 17:20:44.355840 120983 statestore.cc:242] Preparing initial 
> catalog-update topic update for impa...@vd1329.foo:22000. Size = 50.18 MB
> {code}
> From Impalad
> {code}
> E0315 17:20:36.468525  7355 impala-server.cc:1381] There was an error 
> processing the impalad catalog update. Requesting a full topic update to 
> recover: CatalogException: Detected catalog service ID change. Aborting 
> updateCatalog()
> E0315 17:20:40.778920  7355 impala-server.cc:1381] There was an error 
> processing the impalad catalog update. Requesting a full topic update to 
> recover: CatalogException: Detected catalog service ID change. Aborting 
> updateCatalog()
> I0315 17:20:42.951370 11846 StmtMetadataLoader.java:193] Waiting for table 
> metadata. Waited for 170 catalog updates and 340149ms. Tables remaining: 
> [metadata_benchmarks.80_partitions_250k_files]
> I0315 17:21:04.954602 11846 StmtMetadataLoader.java:214] Re-sending 
> prioritized load request. Waited for 180 catalog updates and 362152ms.
> I0315 17:21:04.955175 11846 FeSupport.java:274] Requesting prioritized load 
> of table(s): metadata_benchmarks.80_partitions_250k_files
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-5270) Crash with ORDER BY in OVER clause with RANDOM

2018-03-15 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-5270.

   Resolution: Fixed
Fix Version/s: Impala 2.12.0

commit 42abe8139ebb16f3284328c8f028b47ae86ab78a
Author: Alex Behm 
Date:   Tue Mar 13 10:26:55 2018 -0700

IMPALA-5270: Pass resolved exprs into analytic SortInfo.

The bug was that the SortInfo of analytics was given
ordering exprs that were not fully resolved against their
input (e.g. inline views were not resolved).
As a result, the SortInfo logic did not materialize exprs
like rand() coming from inline views.

The fix is to pass fully resolved exprs to the analytic
SortInfo, and then the existing materialization logic
properly handles non-deterministic built-ins and UDFs.

The code around sort generation was rather convoluted
and difficult to understand. I overhauled SortInfo to
unify the different uses of it under a common codepath
After that cleanup, the fix for this issue was trivial.

Testing:
- Locally ran planner tests
- Locally ran analytic EE tests in test_queries.py
- Core/hdfs run passed

Change-Id: Id2b3f4e5e3f1fd441a63160db3c703c432fbb072
Reviewed-on: http://gerrit.cloudera.org:8080/9631
Reviewed-by: Alex Behm 
Tested-by: Impala Public Jenkins


> Crash with ORDER BY in OVER clause with RANDOM
> --
>
> Key: IMPALA-5270
> URL: https://issues.apache.org/jira/browse/IMPALA-5270
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, 
> Impala 2.10.0, Impala 2.11.0
>Reporter: Thomas Tauber-Marshall
>Assignee: Alexander Behm
>Priority: Critical
>  Labels: crash, performance
> Fix For: Impala 2.12.0
>
>
> A recent change IMPALA-4728 added materialization of sort expressions for 
> performance and to solve an issue with sorting on non-deterministic 
> expressions.
> However, this change doesn't materialize sort exprs when you have an inline 
> view combined with an analytic function.
> Example:
> {code}
> select id, r, last_value(r) over (order by r) from
>   (select id, random() r from functional.alltypestiny) x
> {code}
> A query like the above can lead to a crash just like in IMPALA-4731.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-6675) Change default configuration to --compact_catalog_topic=true

2018-03-15 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6675:
--

 Summary: Change default configuration to 
--compact_catalog_topic=true
 Key: IMPALA-6675
 URL: https://issues.apache.org/jira/browse/IMPALA-6675
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Affects Versions: Impala 2.11.0
Reporter: Alexander Behm
Assignee: Alexander Behm


The catalog metadata can become large and lead to excessive network traffic due 
to dissemination via the statestore. The --compact_catalog_topic flag was 
introduced to mitigate this issue by compressing the catalog topic entries to 
reduce their serialized size.
This saves network bandwidth at the cost of a small quantity of CPU time.

To improve the out-of-the box experience of users we should enable this flag by 
default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-6639) Crash with 'ORDER BY' in 'OVER' clause with 'RANDOM'

2018-03-12 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6639.

Resolution: Duplicate

> Crash with 'ORDER BY' in 'OVER' clause with 'RANDOM'
> 
>
> Key: IMPALA-6639
> URL: https://issues.apache.org/jira/browse/IMPALA-6639
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 3.0, Impala 2.12.0
>Reporter: Balazs Jeszenszky
>Priority: Blocker
>  Labels: crash
>
> The following query crashes Impala reliably:
> {code:java}
> select AVG(n) OVER(ORDER By n) from (
> select RANDOM() as n from (select 1 union all select 1) a) b;
> {code}
> Stack trace:
> {code:java}
> #0  0x7f98565315e5 in raise () from /lib64/libc.so.6
> #1  0x7f9856532dc5 in abort () from /lib64/libc.so.6
> #2  0x7f9858697a55 in os::abort(bool) () from 
> /usr/java/jdk1.7.0_67-cloudera/jre/lib/amd64/server/libjvm.so
> #3  0x7f9858817f87 in VMError::report_and_die() ()
>from /usr/java/jdk1.7.0_67-cloudera/jre/lib/amd64/server/libjvm.so
> #4  0x7f985869c96f in JVM_handle_linux_signal ()
>from /usr/java/jdk1.7.0_67-cloudera/jre/lib/amd64/server/libjvm.so
> #5  
> #6  0x02c9ed73 in impala::Sorter::Run::Run (this=0x94612a0, 
> parent=0xa396080, sort_tuple_desc=0x8d9a750, 
> initial_run=true) at 
> /usr/src/debug/impala-2.11.0-cdh5.14.0/be/src/runtime/sorter.cc:624
> #7  0x02ca5c00 in impala::Sorter::Open (this=0xa396080)
> at /usr/src/debug/impala-2.11.0-cdh5.14.0/be/src/runtime/sorter.cc:1551
> #8  0x02901a09 in impala::SortNode::Open (this=0x945de00, 
> state=0x9d92180)
> at /usr/src/debug/impala-2.11.0-cdh5.14.0/be/src/exec/sort-node.cc:82
> #9  0x02919bcb in impala::AnalyticEvalNode::Open (this=0x6dbb100, 
> state=0x9d92180)
> at 
> /usr/src/debug/impala-2.11.0-cdh5.14.0/be/src/exec/analytic-eval-node.cc:187
> #10 0x01893d03 in impala::FragmentInstanceState::Open (this=0xae64760)
> at 
> /usr/src/debug/impala-2.11.0-cdh5.14.0/be/src/runtime/fragment-instance-state.cc:255
> #11 0x018917fd in impala::FragmentInstanceState::Exec (this=0xae64760)
> at 
> /usr/src/debug/impala-2.11.0-cdh5.14.0/be/src/runtime/fragment-instance-state.cc:80
> #12 0x0187a7ac in impala::QueryState::ExecFInstance (this=0x8fd0d00, 
> fis=0xae64760)
> at 
> /usr/src/debug/impala-2.11.0-cdh5.14.0/be/src/runtime/query-state.cc:382
> #13 0x0187906e in impala::QueryState::::operator()(void) 
> const (__closure=0x7f97ffd88bc8)
> at 
> /usr/src/debug/impala-2.11.0-cdh5.14.0/be/src/runtime/query-state.cc:325
> #14 0x0187b3eb in 
> boost::detail::function::void_function_obj_invoker0,
>  void>::invoke(boost::detail::function::function_buffer &) 
> (function_obj_ptr=...)
> at 
> /usr/src/debug/impala-2.11.0-cdh5.14.0/toolchain/boost-1.57.0-p3/include/boost/function/function_template.hpp:153
> #15 0x017c88de in boost::function0::operator() 
> (this=0x7f97ffd88bc0)
> at 
> /usr/src/debug/impala-2.11.0-cdh5.14.0/toolchain/boost-1.57.0-p3/include/boost/function/function_template.hpp:767
> #16 0x01abe113 in impala::Thread::SuperviseThread (name=..., 
> category=..., functor=..., 
> thread_started=0x7f9800788ab0) at 
> /usr/src/debug/impala-2.11.0-cdh5.14.0/be/src/util/thread.cc:352
> #17 0x01ac6c9e in 
> boost::_bi::list4 std::char_traits, std::allocator > >, 
> boost::_bi::value std::allocator > >, boost::_bi::value >, 
> boost::_bi::value >::operator() std::basic_string&, const std::basic_string&, 
> boost::function, impala::Promise*), 
> boost::_bi::list0>(boost::_bi::type, void (*&)(const 
> std::basic_string &, 
> const std::basic_string 
> &, boost::function, impala::Promise *), boost::_bi::list0 &, 
> int) (this=0xa3901c0, f=@0xa3901b8, a=...)
> {code}
> sort_tuple_size_ ends up as zero in this division:
> https://github.com/cloudera/Impala/blob/cdh5-2.11.0_5.14.0/be/src/runtime/sorter.cc#L624
> Looks like the tuple descriptor is malformed:
> {code:java}
> (gdb) p  *(impala::TupleDescriptor *) sort_tuple_desc
> $4 = {static LLVM_CLASS_NAME = 0x3d649c4 "class.impala::TupleDescriptor", id_ 
> = 8, table_desc_ = 0x0, 
>   byte_size_ = 0, num_null_bytes_ = 0, null_bytes_offset_ = 0, 
>   slots_ = { std::allocator >> = {
>   _M_impl = {> = 
> {<__gnu_cxx::new_allocator> = {}, 
> 

[jira] [Created] (IMPALA-6628) Use unqualified table references in .test files run from test_queries.py

2018-03-07 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6628:
--

 Summary: Use unqualified table references in .test files run from 
test_queries.py
 Key: IMPALA-6628
 URL: https://issues.apache.org/jira/browse/IMPALA-6628
 Project: IMPALA
  Issue Type: Improvement
  Components: Infrastructure
Reporter: Alexander Behm


To increase our test coverage over different file formats we should go through 
the .test files referenced from test_queries.py and switch to using unqualified 
table references where possible.

The state today is that in the exhaustive exploration strategy we run every 
.test file once for every file format. However, since many .test use 
fully-qualified table references we are not actually getting coverage over all 
formats, so we are spending the time to run the tests but not getting the 
coverage we'd like.

I skimmed a few files and identified that at least these could be improved:
analytic-fns.test
subquery.test
limit.test
top-n.test

Likely there are more .test files. Probably there are similar issues in 
different .py files as well, but to keep this JIRA focused I propose we focus 
on test_queries.py first.

*What to do*
* Go through the .test files and change fully-qualified table references to 
unqualified table references where possible. Our test framework issues a "use 


[jira] [Created] (IMPALA-6627) Document Hive-incompatible behavior with the serialization.null.format table property

2018-03-07 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6627:
--

 Summary: Document Hive-incompatible behavior with the 
serialization.null.format table property
 Key: IMPALA-6627
 URL: https://issues.apache.org/jira/browse/IMPALA-6627
 Project: IMPALA
  Issue Type: Improvement
  Components: Docs
Reporter: Alexander Behm
Assignee: Alex Rodoni


Impala only respects the "serialization.null.format" table property for TEXT 
tables and ignores it for Parquet and other formats.

Hive respects the "serialization.null.format" property even for other formats, 
converting matching values to NULL during the scan.

There's is a separate discussion to be had about which behavior makes more 
sense, but let's document this as an incompatibility for now since it has come 
up several times already.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-6626) Failure to assign dictionary predicates should not result in query failure

2018-03-07 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6626:
--

 Summary: Failure to assign dictionary predicates should not result 
in query failure
 Key: IMPALA-6626
 URL: https://issues.apache.org/jira/browse/IMPALA-6626
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Affects Versions: Impala 2.11.0, Impala 2.10.0, Impala 2.9.0
Reporter: Alexander Behm


Assigning dictionary predicates to Parquet scans may involve evaluation of 
expressions in the BE which could fail for various reasons. Such failures 
should lead to non-assignment of dictionary predicates but not to query failure.

See HdfsScanNode:
{code}
private void addDictionaryFilter(...) {
...
try {
  if (analyzer.isTrueWithNullSlots(conjunct)) return;
} catch (InternalException e) { <--- does not handle Exception which will 
cause query to fail
  // Expr evaluation failed in the backend. Skip this conjunct since we 
cannot
  // determine whether it is safe to apply it against a dictionary.
  LOG.warn("Skipping dictionary filter because backend evaluation failed: "
  + conjunct.toSql(), e);
  return;
}
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-6625) Skip dictionary and collection conjunct assignment for non-Parquet scans.

2018-03-07 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6625:
--

 Summary: Skip dictionary and collection conjunct assignment for 
non-Parquet scans.
 Key: IMPALA-6625
 URL: https://issues.apache.org/jira/browse/IMPALA-6625
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Affects Versions: Impala 2.11.0, Impala 2.10.0, Impala 2.9.0
Reporter: Alexander Behm


In HdfsScanNode.init() we try to assign dictionary and collection conjuncts 
even for non-Parquet scans. Such predicates only make sense for Parquet scans, 
so there is no point in collecting them for other scans.

The current behavior is undesirable because:
* init() can be substantially slower because assigning dictionary filters may 
involve evaluating exprs in the BE which can be expensive
* the explain plan of non-Parquet scans may have a section "parquet dictionary 
predicates" which is confusing/misleading

Relevant code snippet from HdfsScanNode:
{code}
@Override
  public void init(Analyzer analyzer) throws ImpalaException {
conjuncts_ = orderConjunctsByCost(conjuncts_);
checkForSupportedFileFormats();

assignCollectionConjuncts(analyzer);
computeDictionaryFilterConjuncts(analyzer);

// compute scan range locations with optional sampling
Set fileFormats = computeScanRangeLocations(analyzer);
...
if (fileFormats.contains(HdfsFileFormat.PARQUET)) { <--- assignment should 
go in here
  computeMinMaxTupleAndConjuncts(analyzer);
}
...
}
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-6586) FrontendTest.TestGetTablesTypeTable failing on some builds

2018-02-27 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6586.

   Resolution: Fixed
Fix Version/s: Impala 2.12.0

commit 5464c09786b6207507d8ebdbc1f0275f007c08e3
Author: Alex Behm 
Date:   Mon Feb 26 12:02:01 2018 -0800

IMPALA-6586: Fix bug in TestGetTablesTypeTable()

The bug in FrontendTest.TestGetTablesTypeTable() was
that it did not explicitly load views that the test
assumed to be loaded already. The test needs to
distinguish between views and tables and views need
to be loaded for them to be discernable from tables.

I was able to reproduce the issue localy by just
running FrontendTest.TestGetTablesTypeTable() without
any other test.

Testing:
- locally ran all tests in FrontendTest individually
  (with a fresh ImpaladTestCatalog)

Change-Id: Idf0bddb2e29209adda5bda5ddc428f46f241c8c9
Reviewed-on: http://gerrit.cloudera.org:8080/9453
Reviewed-by: Alex Behm 
Tested-by: Impala Public Jenkins


> FrontendTest.TestGetTablesTypeTable failing on some builds
> --
>
> Key: IMPALA-6586
> URL: https://issues.apache.org/jira/browse/IMPALA-6586
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.12.0
>Reporter: Tim Armstrong
>Assignee: Alexander Behm
>Priority: Blocker
>  Labels: broken-build
> Fix For: Impala 2.12.0
>
>
> {noformat}
> org.apache.impala.service.FrontendTest.TestGetTablesTypeTable
> Failing for the past 1 build (Since Failed#45 )
> Took 9 ms.
> add description
> Error Message
> expected:<1> but was:<2>
> Stacktrace
> java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.impala.service.FrontendTest.TestGetTablesTypeTable(FrontendTest.java:117)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-6590) Disable expr rewrites for VALUES() statements

2018-02-26 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6590:
--

 Summary: Disable expr rewrites for VALUES() statements
 Key: IMPALA-6590
 URL: https://issues.apache.org/jira/browse/IMPALA-6590
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Affects Versions: Impala 2.11.0, Impala 2.10.0, Impala 2.9.0, Impala 2.8.0
Reporter: Alexander Behm


The analysis of statements with big VALUES clauses like INSERT INTO  
VALUES is slow due to expression rewrites like constant folding. The 
performance of such statements has regressed since the introduction of expr 
rewrites and constant folding in IMPALA-1788.

We should skip expr rewrites for VALUES altogether since it mostly provides no 
benefit but can have a large overhead due to evaluation of expressions in the 
backend (constant folding). These expressions are ultimately evaluated and 
materialized in the backend anyway, so there's no point in folding them during 
analysis.

*Workaround*
{code}
SET ENABLE_EXPR_REWRITES=FALSE;
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-6567) Functional dataload is intermittently super-slow

2018-02-23 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6567.

   Resolution: Fixed
Fix Version/s: Impala 2.12.0

commit ad91e0b04cedb84b5b08c810de4ab1aef036
Author: Alex Behm 
Date:   Thu Feb 22 21:07:27 2018 -0800

IMPALA-6567: ResetMetadataStmt analysis should not load tables.

This fixes a regression introduced by IMPALA-5152 where
invalidate metadata  and refresh  accidentally
required the target table to be loaded during analysis,
ultimately leading to a double load in some situations
(load during analysis, then another load during execution).
Since the purpose of these statements is to reload
metadata it does not make sense to require a table load
during analysis - that load happens during execution.

Note that REFRESH  PARTITION () still
requires the containing table to be loaded. This was
the behavior before IMPALA-5152 and this patch does
not attempt to improve that.

Testing:
- added new unit test
- ran FE tests locally
- validated the desired behavior by inspecting logs
  and the timeine from invalidate/refresh statements

Change-Id: I7033781ebf27ea53cfd26ff0e4f74d4f242bd1dc
Reviewed-on: http://gerrit.cloudera.org:8080/9418
Tested-by: Impala Public Jenkins
Reviewed-by: Alex Behm 


> Functional dataload is intermittently super-slow
> 
>
> Key: IMPALA-6567
> URL: https://issues.apache.org/jira/browse/IMPALA-6567
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.12.0
>Reporter: Joe McDonnell
>Assignee: Alexander Behm
>Priority: Blocker
> Fix For: Impala 2.12.0
>
>
> Recent GVO builds intermittently have a functional dataload of almost 2 hours 
> when it used to be ~30-35 minutes:
>  **
> {noformat}
> 02:12:15 Loading TPC-DS data (logging to 
> /home/ubuntu/Impala/logs/data_loading/load-tpcds.log)...
> 02:34:27 Loading workload 'tpch' using exploration strategy 'core' OK (Took: 
> 22 min 12 sec)
> 02:34:35 Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 
> 22 min 20 sec)
> 04:11:40 Loading workload 'functional-query' using exploration strategy 
> 'exhaustive' OK (Took: 119 min 25 sec)
> {noformat}
>  
> This has happened on multiple runs (including some in progress):
> [https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/1370/]
> [https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/1382/]
> [https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/1383/] (missing some 
> logs due to abort)
> [https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/1384/] (in progress)
> [https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/1385/] (in progress)
>  
> Dataload creates a SQL script that invalidates each table created using an 
> "invalidate metadata ${tablename}" command. There are 830 "invalidate 
> metadata ${tablename}" calls in the invocation of this script (see 
> IMPALA-6386 for why we do invalidate at the table level). Even so, this 
> script should execute very quickly.
> The impalad.INFO from the 1370 run shows that this script is taking a long 
> time. The first invalidate metadata for functional tables is at 2:41 and the 
> last invalidate metadata for this run of the invalidate script is at 3:17. 
> The invalidate script runs twice. The second run begins at 3:19 and finishes 
> at 4:11. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-6575) Avoid double-counting of predicates in join cardinality estimation

2018-02-23 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6575:
--

 Summary: Avoid double-counting of predicates in join cardinality 
estimation
 Key: IMPALA-6575
 URL: https://issues.apache.org/jira/browse/IMPALA-6575
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Affects Versions: Impala 2.11.0, Impala 2.10.0, Impala 2.9.0, Impala 2.8.0, 
Impala 2.7.0, Impala 2.6.0, Impala 2.5.0
Reporter: Alexander Behm


The cardinality of an inner join may be significantly underestimated if (1) an 
equivalent predicate exists on both join inputs, (2) the join condition 
involves the same column as that predicate, and (3) Impala believes the join to 
be FK/PK.

The reason for this underestimation is that the planner double-counts the 
selectivity of predicates on the join input:
* First, the selectivity reduces the cardinality of the join input
* Second, since the join is FK/PK, the build-side selectivity is applied to the 
join cardinality
This second adjustment is not correct in this specific situation because the 
predicate selectivity has already been applied to the probe-side join input.

Example:
{code}
explain select count(*) from functional.alltypes a join functional.alltypes b 
on a.id = b.id and a.id < 10 and b.id < 10;
++
| Explain String
 |
++
| Max Per-Host Resource Reservation: Memory=4.00MB  
 |
| Per-Host Resource Estimates: Memory=279.94MB  
 |
| Codegen disabled by planner   
 |
|   
 |
| F03:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 
 |
| |  Per-Host Resources: mem-estimate=10.00MB mem-reservation=0B
 |
| PLAN-ROOT SINK
 |
| |  mem-estimate=0B mem-reservation=0B 
 |
| | 
 |
| 07:AGGREGATE [FINALIZE]   
 |
| |  output: count:merge(*) 
 |
| |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB
 |
| |  tuple-ids=2 row-size=8B cardinality=1  
 |
| | 
 |
| 06:EXCHANGE [UNPARTITIONED]   
 |
| |  mem-estimate=0B mem-reservation=0B 
 |
| |  tuple-ids=2 row-size=8B cardinality=1  
 |
| | 
 |
| F02:PLAN FRAGMENT [HASH(a.id)] hosts=3 instances=3
 |
| Per-Host Resources: mem-estimate=12.94MB mem-reservation=2.94MB 
runtime-filters-memory=1.00MB  |
| 03:AGGREGATE  
 |
| |  output: count(*)   
 |
| |  mem-estimate=10.00MB mem-reservation=0B spill-buffer=2.00MB
 |
| |  tuple-ids=2 row-size=8B cardinality=1  
 |
| | 
 |
| 02:HASH JOIN [INNER JOIN, PARTITIONED]
 |
| |  hash predicates: a.id = b.id   
 |
| |  fk/pk conjuncts: a.id = b.id   
 |
| |  runtime filters: RF000[bloom] <- b.id  
 |
| |  mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB
 |
| |  tuple-ids=0,1 row-size=8B cardinality=73   <--- should be 730  
   |
| | 
 |
| |--05:EXCHANGE [HASH(b.id)]   
 |
| |  |  mem-estimate=0B 

[jira] [Created] (IMPALA-6568) DDL statements and possibly others do not contain the Query Compilation timeline

2018-02-22 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6568:
--

 Summary: DDL statements and possibly others do not contain the 
Query Compilation timeline
 Key: IMPALA-6568
 URL: https://issues.apache.org/jira/browse/IMPALA-6568
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Affects Versions: Impala 2.11.0, Impala 2.10.0, Impala 2.9.0
Reporter: Alexander Behm


Some statements do not seem to include the "Query Compilation" timeline in the 
query profile.

Repro:
{code}
create table t (i int);
describe t; <-- loads the table, but no FE timeline in profile
invalidate metadata t;
alter table t set tbproperties('numRows'='10'); <-- loads the table, but no FE 
timeline in profile
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-5152) Frontend requests metadata for one table at a time in the query

2018-02-21 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-5152.

   Resolution: Fixed
Fix Version/s: Impala 2.12.0

commit 8ea1ce87e2150c843b4da15f9d42b87006e6ffca
Author: Alex Behm 
Date:   Fri Apr 7 09:58:40 2017 -0700

IMPALA-5152: Introduce metadata loading phase

Reworks the collection and loading of missing metadata
when compiling a statement. Introduces a new
metadata-loading phase between parsing and analysis.
Summary of the new compilation flow:
1. Parse statement.
2. Collect all table references from the parsed
   statement and generate a list of tables that need
   to be loaded for analysis to succeed.
3. Request missing metadata and wait for it to arrive.
   As views become loaded we expand the set of required
   tables based on the view definitions.
   This step populates a statement-local table cache
   that contains all loaded tables relevant to the
   statement.
4. Create a new Analyzer with the table cache and
   analyze the statement. During analysis only the
   table cache is consulted for table metadata, the
   ImpaladCatalog is not used for that purpose anymore.
5. Authorize the statement.
6. Plan generation as usual.

The intent of the existing code was to collect all tables
missing metadata during analysis, load the metadata, and then
re-analyze the statement (and repeat those steps until all
metadata is loaded).
Unfortunately, the relevant code was hard-to-follow, subtle
and not well tested, and therefore it was broken in several
ways over the course of time. For example, the introduction
of path analysis for nested types subtly broke the intended
behavior, and there are other similar examples.

The serial table loading observed in the JIRA was caused by the
following code in the resolution of table references:
for (all path interpretations) {
  try {
// Try to resolve the path; might call getTable() which
// throws for nonexistent tables.
  } catch (AnalysisException e) {
if (analyzer.hasMissingTbls()) throw e;
  }
}

The following example illustrates the problem:
SELECT * FROM a.b, x.y
When resolving the path "a.b" we consider that "a" could be a
database or a table. Similarly, "b" could be a table or a
nested collection.
If the path resolution for "a.b" adds a missing table entry,
then the path resolution for "x.y" could exit prematurely,
without trying the other path interpretations that would
lead to adding the expected missing table. So effectively,
the tables end up being loaded one-by-one.

Testing:
- A core/hdfs run succeeded
- No new tests were added because the existing functional tests
  provide good coverage of various metadata loading scenarios.
- The issue reported in IMPALA-5152 is basically impossible now.
  Adding FE unit tests for that bug specifically would require
  ugly changes to the new code to enable such testing.

Change-Id: I68d32d5acd4a6f6bc6cedb05e6cc5cf604d24a55
Reviewed-on: http://gerrit.cloudera.org:8080/8958
Reviewed-by: Alex Behm 
Tested-by: Impala Public Jenkins


> Frontend requests metadata for one table at a time in the query 
> 
>
> Key: IMPALA-5152
> URL: https://issues.apache.org/jira/browse/IMPALA-5152
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog, Frontend
>Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0
>Reporter: Mostafa Mokhtar
>Assignee: Alexander Behm
>Priority: Critical
>  Labels: Performance, frontend
> Fix For: Impala 2.12.0
>
>
> It appears that the Frontend serializes loading metadata for missing tables 
> in a query, Catalog log shows that the queue size is alway 0. 
> Query below references  9 tables and metadata is loaded for one table at a 
> time. 
> {code}
> explain select i_item_id ,i_item_desc ,s_state ,count(ss_quantity) as 
> store_sales_quantitycount ,avg(ss_quantity) as store_sales_quantityave 
> ,stddev_samp(ss_quantity) as store_sales_quantitystdev 
> ,stddev_samp(ss_quantity)/avg(ss_quantity) as store_sales_quantitycov 
> ,count(sr_return_quantity) as store_returns_quantitycount 
> ,avg(sr_return_quantity) as store_returns_quantityave 
> ,stddev_samp(sr_return_quantity) as store_returns_quantitystdev 
> ,stddev_samp(sr_return_quantity)/avg(sr_return_quantity) as 
> store_returns_quantitycov ,count(cs_quantity) as catalog_sales_quantitycount 
> ,avg(cs_quantity) as catalog_sales_quantityave ,stddev_samp(cs_quantity) as 
> 

[jira] [Created] (IMPALA-6536) CREATE TABLE on S3 takes a very long time

2018-02-16 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6536:
--

 Summary: CREATE TABLE on S3 takes a very long time
 Key: IMPALA-6536
 URL: https://issues.apache.org/jira/browse/IMPALA-6536
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Affects Versions: Impala 2.11.0, Impala 2.10.0, Impala 2.9.0
Reporter: Alexander Behm
Assignee: Alexander Behm


*Summary*
Creating a table that points to existing data in S3 can take an excessive 
amount of time.

*Reason*
If the Hive Metastore is configured with "hive.stats.autogather=true" then Hive 
lists the files of newly created tables to populate basic statistics like file 
count and file byte sizes. Unfortunately, this listing operation can take an 
excessive amount of time particularly on S3.

*Workarounds*
* Add TBLPROPERTIES("DO_NOT_UPDATE_STATS"="true") to your CREATE TABLE
* Or reconfigure the Hive Metastore with "hive.stats.autogather=false"

*Example*
{code}
CREATE EXTERNAL TABLE tpch_lineitem_s3 (
  l_orderkey BIGINT,
  l_partkey BIGINT,
  l_suppkey BIGINT,
  l_linenumber BIGINT,
  l_quantity DECIMAL(12,2),
  l_extendedprice DECIMAL(12,2),
  l_discount DECIMAL(12,2),
  l_tax DECIMAL(12,2),
  l_returnflag STRING,
  l_linestatus STRING,
  l_shipdate STRING,
  l_commitdate STRING,
  l_receiptdate STRING,
  l_shipinstruct STRING,
  l_shipmode STRING,
  l_comment STRING
)
STORED AS PARQUET
LOCATION "s3a://some_location/my_existing_data"
TBLPROPERTIES("DO_NOT_UPDATE_STATS"="true"); <--- Add this as a workaround
{code}

Impala should probably add the workaround automatically in CREATE TABLE since 
Impala does not even use those basic statistics populated by the Hive Metastore.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-6492) impalad ERROR log flooding in test runs

2018-02-07 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6492:
--

 Summary: impalad ERROR log flooding in test runs
 Key: IMPALA-6492
 URL: https://issues.apache.org/jira/browse/IMPALA-6492
 Project: IMPALA
  Issue Type: Bug
  Components: Backend, Infrastructure
Affects Versions: Impala 2.12.0
Reporter: Alexander Behm


I noticed that our ERROR log files are being flooded in our build+test runs. I 
don't know if that is expected, but I suspect it is not.

The impalad ERROR logs contain >10MB of presumably useless/unintended output 
like this:
{code}
FSDataOutputStream#close error:
RemoteException: File does not exist: 
/test-warehouse/functional.db/alltypesinsert/_impala_insert_staging/984ed8a9aef475a8_f4f944d5/.984ed8a9aef475a8-f4f944d50008_402940055_dir/year=2009/month=3444/984ed8a9aef475a8-f4f944d50008_917686563_data.0.
 (inode 56482) Holder DFSClient_NONMAPREDUCE_944983647_1 does not have any open 
files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2761)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:691)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:677)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2804)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:917)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:595)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: 
/test-warehouse/functional.db/alltypesinsert/_impala_insert_staging/984ed8a9aef475a8_f4f944d5/.984ed8a9aef475a8-f4f944d50008_402940055_dir/year=2009/month=3444/984ed8a9aef475a8-f4f944d50008_917686563_data.0.
 (inode 56482) Holder DFSClient_NONMAPREDUCE_944983647_1 does not have any open 
files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2761)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:691)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:677)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2804)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:917)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:595)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491)
at org.apache.hadoop.ipc.Client.call(Client.java:1437)
at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy11.complete(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:536)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown 

[jira] [Created] (IMPALA-6491) More robust HBase scan cardinality estimation

2018-02-07 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6491:
--

 Summary: More robust HBase scan cardinality estimation
 Key: IMPALA-6491
 URL: https://issues.apache.org/jira/browse/IMPALA-6491
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Affects Versions: Impala 2.11.0, Impala 2.10.0, Impala 2.9.0, Impala 2.8.0, 
Impala 2.7.0, Impala 2.6.0, Impala 2.5.0
Reporter: Alexander Behm


There are a few issues with our HBase scan cardinality estimation:
1. The cardinality estimates can be very inaccurate leading to bad plan 
choices. In particular, users have reported cases of severe underestimation 
which can have a ripple effect in the query plan (e.g. planner thinks a join 
with that table is selective)
2. Unlike HDFS scans, we do not use row count statistics from the Hive 
Metastore for estimating the cardinality of HBase scans. Instead, we do a small 
scan over the HBase table and estimate a row count based on the average bytes 
per row and the storefile size.

There are other more detailed caveats with the HBase estimation method.

The original motivation of this method was to adjust the row count for queries 
that only scan a subset of the region servers (the HMS statistics only cover 
the entire table).

*Proposal*
To address these shortcomings, we could start with the table-level row count 
store in the Metastore and then adjust that number based on the total number of 
bytes in the table and the number of bytes in the relevant region servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-6489) ASAN use-after-poison in impala::HdfsScanner::InitTupleFromTemplate

2018-02-07 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6489:
--

 Summary: ASAN use-after-poison in 
impala::HdfsScanner::InitTupleFromTemplate
 Key: IMPALA-6489
 URL: https://issues.apache.org/jira/browse/IMPALA-6489
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Affects Versions: Impala 2.12.0
Reporter: Alexander Behm
Assignee: Tim Armstrong


Tim, can you take a look? Feel free to re-assign.

Relevant dump from impalad.ERROR:
{code}
==19705==ERROR: AddressSanitizer: use-after-poison on address 0x621000b60905 at 
pc 0x01374875 bp 0x7f9c3366f000 sp 0x7f9c3366e7b0
READ of size 17 at 0x621000b60905 thread T76302
E0207 01:08:01.352087  4379 LiteralExpr.java:186] Failed to evaluate expr 
'85070591730234615865843651857942052864 - 
58.4864500206578527026870244535800'
E0207 01:08:02.044962  4379 LiteralExpr.java:186] Failed to evaluate expr 
'85070591730234615865843651857942052864 - 
58.4864500206578527026870244535800'
#0 0x1374874 in __asan_memcpy 
/data/jenkins/workspace/impala-toolchain-package-build/label/ec2-package-centos-6/toolchain/source/llvm/llvm-3.9.1.src/projects/compiler-rt/lib/asan/asan_interceptors.cc:413
#1 0x1c2c111 in impala::HdfsScanner::InitTupleFromTemplate(impala::Tuple*, 
impala::Tuple*, int) 
/data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/be/src/exec/hdfs-scanner.h:445:5
#2 0x1c9a97d in 
impala::HdfsParquetScanner::AssembleCollection(std::vector > const&, int, 
impala::CollectionValueBuilder*) 
/data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/be/src/exec/hdfs-parquet-scanner.cc:1303:7
#3 0x1d34752 in impala::CollectionColumnReader::ReadSlot(impala::Tuple*, 
impala::MemPool*) 
/data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/be/src/exec/parquet-column-readers.cc:1281:38
#4 0x1d31b5f in impala::CollectionColumnReader::ReadValue(impala::MemPool*, 
impala::Tuple*) 
/data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/be/src/exec/parquet-column-readers.cc:1258:12
#5 0x1d2a48e in 
impala::ParquetColumnReader::ReadValueBatch(impala::MemPool*, int, int, 
unsigned char*, int*) 
/data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/be/src/exec/parquet-column-readers.cc:804:26
#6 0x1c93de2 in 
impala::HdfsParquetScanner::AssembleRows(std::vector > const&, impala::RowBatch*, 
bool*) 
/data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/be/src/exec/hdfs-parquet-scanner.cc:1034:42
#7 0x1c8fd58 in 
impala::HdfsParquetScanner::GetNextInternal(impala::RowBatch*) 
/data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/be/src/exec/hdfs-parquet-scanner.cc:507:19
#8 0x1c8df5f in impala::HdfsParquetScanner::ProcessSplit() 
/data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/be/src/exec/hdfs-parquet-scanner.cc:405:21
#9 0x1be6f0e in 
impala::HdfsScanNode::ProcessSplit(std::vector const&, impala::MemPool*, 
impala::io::ScanRange*) 
/data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/be/src/exec/hdfs-scan-node.cc:532:21
#10 0x1be60c9 in impala::HdfsScanNode::ScannerThread() 
/data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/be/src/exec/hdfs-scan-node.cc:442:16
#11 0x16a19c2 in boost::function0::operator()() const 
/data/jenkins/workspace/impala-asf-master-core-asan/Impala-Toolchain/boost-1.57.0-p3/include/boost/function/function_template.hpp:766:14
#12 0x1af97c3 in impala::Thread::SuperviseThread(std::string const&, 
std::string const&, boost::function, impala::Promise*) 
/data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/be/src/util/thread.cc:354:3
#13 0x1b04685 in void boost::_bi::list4, 
boost::_bi::value >::operator(), impala::Promise*), 
boost::_bi::list0>(boost::_bi::type, void (*&)(std::string const&, 
std::string const&, boost::function, impala::Promise*), 
boost::_bi::list0&, int) 
/data/jenkins/workspace/impala-asf-master-core-asan/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/bind.hpp:457:9
#14 0x1b04501 in boost::_bi::bind_t, 
boost::_bi::value > >::operator()() 
/data/jenkins/workspace/impala-asf-master-core-asan/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/bind_template.hpp:20:16
#15 0x2fc7c49 in thread_proxy 
(/data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/be/build/debug/service/impalad+0x2fc7c49)
#16 0x37c3807850 in start_thread 

[jira] [Resolved] (IMPALA-6485) BE compilation failure: error: ‘EVP_CTRL_GCM_SET_IVLEN’ was not declared in this scope

2018-02-06 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6485.

   Resolution: Fixed
Fix Version/s: Impala 2.12.0

commit 4ca7f261e437c2eb49d8d114e9af3696c61428f4
Author: Alex Behm 
Date:   Tue Feb 6 11:50:51 2018 -0800

Revert "IMPALA-6219: Use AES-GCM for spill-to-disk encryption"

This reverts commit 9b68645f9eb9e08899fda860e0946cc05f205479.

Change-Id: Ia06f061a4ecedd1df0d359fe06fe84618b5e07fb


> BE compilation failure: error: ‘EVP_CTRL_GCM_SET_IVLEN’ was not declared in 
> this scope
> --
>
> Key: IMPALA-6485
> URL: https://issues.apache.org/jira/browse/IMPALA-6485
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Affects Versions: Impala 2.12.0
>Reporter: Alexander Behm
>Assignee: Tim Armstrong
>Priority: Blocker
>  Labels: broken-build
> Fix For: Impala 2.12.0
>
>
> Failure:
> {code}
> 20:39:32 
> /data/jenkins/workspace/impala-asf-master-core/repos/Impala/be/src/util/openssl-util.cc:
>  In member function ‘impala::Status 
> impala::EncryptionKey::EncryptInternal(bool, const uint8_t*, int64_t, 
> uint8_t*)’:
> 20:39:32 
> /data/jenkins/workspace/impala-asf-master-core/repos/Impala/be/src/util/openssl-util.cc:134:31:
>  error: ‘EVP_CTRL_GCM_SET_IVLEN’ was not declared in this scope
> 20:39:32  EVP_CIPHER_CTX_ctrl(, EVP_CTRL_GCM_SET_IVLEN, 
> AES_BLOCK_SIZE, NULL);
> 20:39:32^
> 20:39:32 
> /data/jenkins/workspace/impala-asf-master-core/repos/Impala/be/src/util/openssl-util.cc:169:31:
>  error: ‘EVP_CTRL_GCM_SET_TAG’ was not declared in this scope
> 20:39:32  EVP_CIPHER_CTX_ctrl(, EVP_CTRL_GCM_SET_TAG, AES_BLOCK_SIZE, 
> gcm_tag_);
> 20:39:32^
> 20:39:32 
> /data/jenkins/workspace/impala-asf-master-core/repos/Impala/be/src/util/openssl-util.cc:181:31:
>  error: ‘EVP_CTRL_GCM_GET_TAG’ was not declared in this scope
> 20:39:32  EVP_CIPHER_CTX_ctrl(, EVP_CTRL_GCM_GET_TAG, AES_BLOCK_SIZE, 
> gcm_tag_);
> 20:39:32^
> 20:39:32 make[2]: *** [be/src/util/CMakeFiles/Util.dir/openssl-util.cc.o] 
> Error 1
> 20:39:32 make[2]: *** Waiting for unfinished jobs
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-5037) Change default Parquet array resolution according to Parquet standard.

2018-02-06 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-5037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-5037.

   Resolution: Fixed
Fix Version/s: Impala 3.0

commit dedff5e25c19da724ae032a52354cd4fd61ec40a
Author: Alex Behm 
Date:   Thu Feb 1 16:57:53 2018 -0800

IMPALA-5037: Default PARQUET_ARRAY_RESOLUTION=THREE_LEVEL

Changes the default value for the PARQUET_ARRAY_RESOLUTION
query option to conform to the Parquet standard.
Before: TWO_LEVEL_THEN_THREE_LEVEL
After:  THREE_LEVEL

For more information see:
* IMPALA-4725
* https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

Testing:
- expands and cleans up the existing tests for more coverage
  over the different resolution policies
- private core/hdfs run passed

Cherry-picks: not for 2.x.

Change-Id: Ib8f7e9010c4354d667305d9df7b78862efb23fe1
Reviewed-on: http://gerrit.cloudera.org:8080/9210
Reviewed-by: Alex Behm 
Tested-by: Impala Public Jenkins


> Change default Parquet array resolution according to Parquet standard.
> --
>
> Key: IMPALA-5037
> URL: https://issues.apache.org/jira/browse/IMPALA-5037
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.8.0, Impala 2.9.0
>Reporter: Alexander Behm
>Assignee: Alexander Behm
>Priority: Major
>  Labels: include-in-v3, incompatibility
> Fix For: Impala 3.0
>
>
> With IMPALA-4725 we've introduced query options to control the field 
> resolution behavior when scanning Parquet files with nested arrays. The 
> current default strategy currently tries to auto-detect the array encoding 
> within Parquet files, but this strategy can sometimes subtly go wrong and 
> return incorrect results due to the inherent ambiguity of the 2/3-level 
> encoding schemes in Parquet.
> We should switch the default resolution strategy according to the Parquet 
> standard 3-level encoding, instead of the current auto-detect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-6485) BE compilation failure: error: ‘EVP_CTRL_GCM_SET_IVLEN’ was not declared in this scope

2018-02-06 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6485:
--

 Summary: BE compilation failure: error: ‘EVP_CTRL_GCM_SET_IVLEN’ 
was not declared in this scope
 Key: IMPALA-6485
 URL: https://issues.apache.org/jira/browse/IMPALA-6485
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Affects Versions: Impala 2.12.0
Reporter: Alexander Behm
Assignee: Tim Armstrong


Failure:
{code}
20:39:32 
/data/jenkins/workspace/impala-asf-master-core/repos/Impala/be/src/util/openssl-util.cc:
 In member function ‘impala::Status 
impala::EncryptionKey::EncryptInternal(bool, const uint8_t*, int64_t, 
uint8_t*)’:
20:39:32 
/data/jenkins/workspace/impala-asf-master-core/repos/Impala/be/src/util/openssl-util.cc:134:31:
 error: ‘EVP_CTRL_GCM_SET_IVLEN’ was not declared in this scope
20:39:32  EVP_CIPHER_CTX_ctrl(, EVP_CTRL_GCM_SET_IVLEN, AES_BLOCK_SIZE, 
NULL);
20:39:32^
20:39:32 
/data/jenkins/workspace/impala-asf-master-core/repos/Impala/be/src/util/openssl-util.cc:169:31:
 error: ‘EVP_CTRL_GCM_SET_TAG’ was not declared in this scope
20:39:32  EVP_CIPHER_CTX_ctrl(, EVP_CTRL_GCM_SET_TAG, AES_BLOCK_SIZE, 
gcm_tag_);
20:39:32^
20:39:32 
/data/jenkins/workspace/impala-asf-master-core/repos/Impala/be/src/util/openssl-util.cc:181:31:
 error: ‘EVP_CTRL_GCM_GET_TAG’ was not declared in this scope
20:39:32  EVP_CIPHER_CTX_ctrl(, EVP_CTRL_GCM_GET_TAG, AES_BLOCK_SIZE, 
gcm_tag_);
20:39:32^
20:39:32 make[2]: *** [be/src/util/CMakeFiles/Util.dir/openssl-util.cc.o] Error 
1
20:39:32 make[2]: *** Waiting for unfinished jobs
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-6484) Crash in impala::RuntimeProfile::SortChildren

2018-02-06 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6484:
--

 Summary: Crash in impala::RuntimeProfile::SortChildren
 Key: IMPALA-6484
 URL: https://issues.apache.org/jira/browse/IMPALA-6484
 Project: IMPALA
  Issue Type: Bug
  Components: Backend
Affects Versions: Impala 2.12.0
Reporter: Alexander Behm
Assignee: Lars Volker


Lars, assigning to you since you touched the relevant code last

Tests running at the time (as far as I can tell):
{code}
query_test/test_spilling.py
query_test/test_insert_parquet.py
TestScannersFuzzing.test_fuzz_decimal_tbl
{code}

Backtrace:
{code}
#0  0x0030120328e5 in raise () from /lib64/libc.so.6
#1  0x0030120340c5 in abort () from /lib64/libc.so.6
#2  0x7f49030c81a5 in os::abort(bool) () from 
/opt/toolchain/sun-jdk-64bit-1.8.0.05/jre/lib/amd64/server/libjvm.so
#3  0x7f4903258843 in VMError::report_and_die() () from 
/opt/toolchain/sun-jdk-64bit-1.8.0.05/jre/lib/amd64/server/libjvm.so
#4  0x7f49030cd562 in JVM_handle_linux_signal () from 
/opt/toolchain/sun-jdk-64bit-1.8.0.05/jre/lib/amd64/server/libjvm.so
#5  0x7f49030c44f3 in signalHandler(int, siginfo*, void*) () from 
/opt/toolchain/sun-jdk-64bit-1.8.0.05/jre/lib/amd64/server/libjvm.so
#6  
#7  0x0182355a in std::_Rb_tree, 
std::pair const, impala::RuntimeProfile::Counter*>, 
std::_Select1st const, impala::RuntimeProfile::Counter*> >, 
std::less >, std::allocator const, impala::RuntimeProfile::Counter*> > >::_M_begin 
(this=0x382e657461745393) at 
/data/jenkins/workspace/impala-asf-master-core/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_tree.h:518
#8  0x01821eb2 in std::_Rb_tree, 
std::pair const, impala::RuntimeProfile::Counter*>, 
std::_Select1st const, impala::RuntimeProfile::Counter*> >, 
std::less >, std::allocator const, impala::RuntimeProfile::Counter*> > 
>::lower_bound (this=0x382e657461745393, __k=...) at 
/data/jenkins/workspace/impala-asf-master-core/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_tree.h:927
#9  0x01820bc9 in std::map, 
impala::RuntimeProfile::Counter*, std::less >, 
std::allocator const, impala::RuntimeProfile::Counter*> > 
>::lower_bound (this=0x382e657461745393, __x=...) at 
/data/jenkins/workspace/impala-asf-master-core/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_map.h:902
#10 0x0181f8b6 in std::map, 
impala::RuntimeProfile::Counter*, std::less >, 
std::allocator const, impala::RuntimeProfile::Counter*> > >::operator[] 
(this=0x382e657461745393, __k=...) at 
/data/jenkins/workspace/impala-asf-master-core/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_map.h:496
#11 0x0184e526 in impala::RuntimeProfile::total_time_counter 
(this=0x382e657461745373) at 
/data/jenkins/workspace/impala-asf-master-core/repos/Impala/be/src/util/runtime-profile.h:261
#12 0x02cb656b in operator() (this=0x7f483784a380, a=..., b=...) at 
/data/jenkins/workspace/impala-asf-master-core/repos/Impala/be/src/runtime/coordinator-backend-state.cc:554
#13 0x02cbe794 in 
__gnu_cxx::__ops::_Val_comp_iter::operator(), __gnu_cxx::__normal_iterator*, 
std::vector > > > 
(this=0x7f483784a380, __val=..., __it=...) at 
/data/jenkins/workspace/impala-asf-master-core/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/predefined_ops.h:166
#14 0x02cbdcc0 in 
std::__unguarded_linear_insert<__gnu_cxx::__normal_iterator*, std::vector > >, 
__gnu_cxx::__ops::_Val_comp_iter > (__last=..., __comp=...) 
at 
/data/jenkins/workspace/impala-asf-master-core/Impala-Toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_algo.h:1827
#15 0x02cbd01b in 

[jira] [Resolved] (IMPALA-6024) Add minimum sample size for COMPUTE STATS TABLESAMPLE

2018-01-31 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6024.

   Resolution: Fixed
Fix Version/s: Impala 2.12.0

commit 22d9ac08937f348b21075b276d487f4b1ba3524c
Author: Alex Behm 
Date:   Mon Jan 22 23:07:25 2018 -0800

IMPALA-6024: Min sample bytes for COMPUTE STATS TABLESAMPLE

Adds a new query option COMPUTE_STATS_MIN_SAMPLE_SIZE
which is the minimum number of bytes that will be scanned
in COMPUTE STATS TABLESAMPLE, regardless of the user-supplied
sampling percent.

The motivation is to prevent sampling for very small tables
where accurate stats can be obtained cheaply without sampling.

This patch changes COMPUTE STATS TABLESAMPLE to run the regular
COMPUTE STATS if the effective sampling percent is 0% or 100%.
For a 100% sampling rate, the sampling-based stats queries
are more expensive and produce less accurate stats than the
regular COMPUTE STATS.

Default: COMPUTE_STATS_MIN_SAMPLE_SIZE=1GB

Testing:
- added new unit tests and ran them locally

Change-Id: I2cb91a40bec50b599875109c2f7c5bf6f41c2400
Reviewed-on: http://gerrit.cloudera.org:8080/9113
Reviewed-by: Alex Behm 
Tested-by: Impala Public Jenkins


> Add minimum sample size for COMPUTE STATS TABLESAMPLE
> -
>
> Key: IMPALA-6024
> URL: https://issues.apache.org/jira/browse/IMPALA-6024
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Frontend
>Affects Versions: Impala 2.10.0, Impala 2.11.0
>Reporter: Alexander Behm
>Assignee: Alexander Behm
>Priority: Major
> Fix For: Impala 2.12.0
>
>
> We should introduce a minimum sample size in bytes for COMPUTE STATS 
> TABLESAMPLE. Reasons:
> * For small tables sampling does not make sense. Accurate stats can be 
> obtained cheaply without sampling.
> * Very small sample sizes mostly do not make sense - some minimum of data is 
> required to get meaningful stats. 
> I think a 1GB minimum might be a good choice and ideally this minimum sample 
> size would be configurable.
> Many other DBMS have stats collection with sampling and in many cases a 
> minimum sample size is required to get any meaningful stats.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-6438) Document IMPALA-5191: Incompatible change to alias and ordinal substitutions

2018-01-24 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6438.

Resolution: Duplicate

Duplicates IMPALA-6415

> Document IMPALA-5191: Incompatible change to alias and ordinal substitutions
> 
>
> Key: IMPALA-6438
> URL: https://issues.apache.org/jira/browse/IMPALA-6438
> Project: IMPALA
>  Issue Type: Documentation
>  Components: Docs
>Affects Versions: Impala 2.11.0
>Reporter: Alexander Behm
>Assignee: Zoltán Borók-Nagy
>Priority: Blocker
>  Labels: incompatibility
>
> We should carefully document the before and after behavior of IMPALA-5191 and 
> be sure to call this out as an incompatible change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-6438) Document IMPALA-5191: Incompatible change to alias and ordinal substitutions

2018-01-24 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6438:
--

 Summary: Document IMPALA-5191: Incompatible change to alias and 
ordinal substitutions
 Key: IMPALA-6438
 URL: https://issues.apache.org/jira/browse/IMPALA-6438
 Project: IMPALA
  Issue Type: Documentation
  Components: Docs
Affects Versions: Impala 2.11.0
Reporter: Alexander Behm
Assignee: Zoltán Borók-Nagy


We should carefully document the before and after behavior of IMPALA-5191 and 
be sure to call this out as an incompatible change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-6422) Compute stats tablesample spends a lot of time in powf()

2018-01-19 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6422.

   Resolution: Fixed
Fix Version/s: Impala 2.1.2

commit 1dfdc6704b74c77d63accb69e9197fd203455be0
Author: Alex Behm 
Date:   Thu Jan 18 19:06:30 2018 -0800

IMPALA-6422: Use ldexp() instead of powf() in HLL.

Using ldexp() to compute a floating point power of two is
over 10x faster than powf().

This change is particularly helpful for speeding up
COMPUTE STATS TABLESAMPLE which has many calls to
HllFinalEstimate() where floating point power of two
computations are relevant.

Testing:
- core/hdfs run passed

Change-Id: I517614d3f9cf1cf56b15a173c3cfe76e0f2e0382
Reviewed-on: http://gerrit.cloudera.org:8080/9078
Reviewed-by: Alex Behm 
Tested-by: Impala Public Jenkins


> Compute stats tablesample spends a lot of time in powf()
> 
>
> Key: IMPALA-6422
> URL: https://issues.apache.org/jira/browse/IMPALA-6422
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.11.0
>Reporter: Alexander Behm
>Assignee: Alexander Behm
>Priority: Major
>  Labels: compute-stats, perfomance
> Fix For: Impala 2.1.2
>
>
> [~mmokhtar] did perf profiling for COMPUTE STATS TABLESAMPLE and discovered 
> that a lot of time is spent on finalizing HLL intermediates. Most time is 
> spent in powf().
> Relevant snippet from AggregateFunctions::HllFinalEstimate() in 
> aggregate-functions-ir.cc:
> {code}
>   for (int i = 0; i < num_buckets; ++i) {
> harmonic_mean += powf(2.0f, -buckets[i]);
> if (buckets[i] == 0) ++num_zero_registers;
>   }
> {code}
> Since we're doing a power of 2 using ldexp() should be much more efficient.
> I did a microbenchmark and found that ldexp() is >10x faster than powf() for 
> this scenario.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IMPALA-6422) Compute stats tablesample spends a lot of time in powf()

2018-01-18 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6422:
--

 Summary: Compute stats tablesample spends a lot of time in powf()
 Key: IMPALA-6422
 URL: https://issues.apache.org/jira/browse/IMPALA-6422
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Affects Versions: Impala 2.11.0
Reporter: Alexander Behm
Assignee: Alexander Behm


[~mmokhtar] did perf profiling for COMPUTE STATS TABLESAMPLE and discovered 
that a lot of time is spend on finalizing HLL intermediates. Most time is spent 
in powf().

Relevant snippet from AggregateFunctions::HllFinalEstimate() in 
aggregate-functions-ir.cc:
{code}
  for (int i = 0; i < num_buckets; ++i) {
harmonic_mean += ldexp(1.0f, -buckets[i]);
if (buckets[i] == 0) ++num_zero_registers;
  }
{code}

Since we're doing a power of 2 using ldexp() should be much more efficient.

I did a microbenchmark and found that ldexp() is >10x faster than powf() for 
this scenario.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IMPALA-6402) PlannerTest.testFkPkJoinDetection failure due to missing partition

2018-01-17 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6402.

Resolution: Cannot Reproduce

> PlannerTest.testFkPkJoinDetection failure due to missing partition
> --
>
> Key: IMPALA-6402
> URL: https://issues.apache.org/jira/browse/IMPALA-6402
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 2.11.0
> Environment: PlannerTest.testFkPkJoinDetection
>Reporter: Sailesh Mukil
>Assignee: Alexander Behm
>Priority: Blocker
>  Labels: broken-build
>
>  
> PlannerTest.testFkPkJoinDetection failed recent test run. From the output 
> below, it looks like the test expects 1823 partitions and files to be 
> present, however, there is one extra file erroneously present. This may be 
> due to filesystem flakiness (a file that was supposed to be deleted wasn't, 
> etc.), but we're not certain yet.
>  
> {code:java}
> ---
> Test set: org.apache.impala.planner.PlannerTest
> ---
> Tests run: 64, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 62.354 sec 
> <<< FAILURE! - in org.apache.impala.planner.PlannerTest
> testFkPkJoinDetection(org.apache.impala.planner.PlannerTest) Time elapsed: 
> 4.821 sec <<< FAILURE!
> java.lang.AssertionError: 
> Section PLAN of query:
> select /* +straight_join */ 1 from
> tpcds_seq_snap.store_sales inner join tpcds.customer
> on ss_customer_sk = c_customer_sk
> Actual does not match expected result:
> F00:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> | Per-Host Resources: mem-estimate=177.94MB mem-reservation=1.94MB
> PLAN-ROOT SINK
> | mem-estimate=0B mem-reservation=0B
> |
> 02:HASH JOIN [INNER JOIN]
> | hash predicates: ss_customer_sk = c_customer_sk
> | fk/pk conjuncts: assumed fk/pk
> | runtime filters: RF000[bloom] <- c_customer_sk
> | mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB
> | tuple-ids=0,1 row-size=8B cardinality=unavailable
> |
> |--01:SCAN HDFS [tpcds.customer]
> | partitions=1/1 files=1 size=12.60MB
> | stored statistics:
> | table: rows=10 size=12.60MB
> | columns: all
> | extrapolated-rows=disabled
> | mem-estimate=48.00MB mem-reservation=0B
> | tuple-ids=1 row-size=4B cardinality=10
> |
> 00:SCAN HDFS [tpcds_seq_snap.store_sales]
>  partitions=1823/1823 files=1823 size=207.90MB
> 
>  runtime filters: RF000[bloom] -> ss_customer_sk
>  stored statistics:
>  table: rows=unavailable size=unavailable
>  partitions: 0/1823 rows=unavailable
>  columns: unavailable
>  extrapolated-rows=disabled
>  mem-estimate=128.00MB mem-reservation=0B
>  tuple-ids=0 row-size=4B cardinality=unavailable
> Expected:
> F00:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> | Per-Host Resources: mem-estimate=177.94MB mem-reservation=1.94MB
> PLAN-ROOT SINK
> | mem-estimate=0B mem-reservation=0B
> |
> 02:HASH JOIN [INNER JOIN]
> | hash predicates: ss_customer_sk = c_customer_sk
> | fk/pk conjuncts: assumed fk/pk
> | runtime filters: RF000[bloom] <- c_customer_sk
> | mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB
> | tuple-ids=0,1 row-size=8B cardinality=unavailable
> |
> |--01:SCAN HDFS [tpcds.customer]
> | partitions=1/1 files=1 size=12.60MB
> | stored statistics:
> | table: rows=10 size=12.60MB
> | columns: all
> | extrapolated-rows=disabled
> | mem-estimate=48.00MB mem-reservation=0B
> | tuple-ids=1 row-size=4B cardinality=10
> |
> 00:SCAN HDFS [tpcds_seq_snap.store_sales]
>  partitions=1824/1824 files=1824 size=207.90MB
>  runtime filters: RF000[bloom] -> ss_customer_sk
>  stored statistics:
>  table: rows=unavailable size=unavailable
>  partitions: 0/1824 rows=unavailable
>  columns: unavailable
>  extrapolated-rows=disabled
>  mem-estimate=128.00MB mem-reservation=0B
>  tuple-ids=0 row-size=4B cardinality=unavailable
> Verbose plan:
> F00:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> Per-Host Resources: mem-estimate=177.94MB mem-reservation=1.94MB
>  PLAN-ROOT SINK
>  | mem-estimate=0B mem-reservation=0B
>  |
>  02:HASH JOIN [INNER JOIN]
>  | hash predicates: ss_customer_sk = c_customer_sk
>  | fk/pk conjuncts: assumed fk/pk
>  | runtime filters: RF000[bloom] <- c_customer_sk
>  | mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB
>  | tuple-ids=0,1 row-size=8B cardinality=unavailable
>  |
>  |--01:SCAN HDFS [tpcds.customer]
>  | partitions=1/1 files=1 size=12.60MB
>  | stored statistics:
>  | table: rows=10 size=12.60MB
>  | columns: all
>  | extrapolated-rows=disabled
>  | mem-estimate=48.00MB mem-reservation=0B
>  | tuple-ids=1 row-size=4B 

[jira] [Created] (IMPALA-6397) IllegalStateException in planning of aggregation with float and decimal literal child expressions

2018-01-13 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6397:
--

 Summary: IllegalStateException in planning of aggregation with 
float and decimal literal child expressions
 Key: IMPALA-6397
 URL: https://issues.apache.org/jira/browse/IMPALA-6397
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Affects Versions: Impala 2.11.0, Impala 2.10.0, Impala 2.9.0
Reporter: Alexander Behm


Reproduction:
{code}
select sum(float_col + d) from (select float_col, 1.2 d from 
functional.alltypes) v;
ERROR: IllegalStateException: Agg expr sum(float_col + 1.2) returns type DOUBLE 
but its output tuple slot has type DECIMAL(38,9)
{code}

FE Stack:
{code}
I0113 14:44:36.300395  9285 jni-util.cc:211] java.lang.IllegalStateException: 
Agg expr sum(f + 1.2) returns type DOUBLE but its output tuple slot has type 
DECIMAL(38,9)
at 
com.google.common.base.Preconditions.checkState(Preconditions.java:145)
at 
org.apache.impala.analysis.AggregateInfo.checkConsistency(AggregateInfo.java:702)
at 
org.apache.impala.planner.AggregationNode.init(AggregationNode.java:165)
at 
org.apache.impala.planner.SingleNodePlanner.createAggregationPlan(SingleNodePlanner.java:895)
at 
org.apache.impala.planner.SingleNodePlanner.createSelectPlan(SingleNodePlanner.java:621)
at 
org.apache.impala.planner.SingleNodePlanner.createQueryPlan(SingleNodePlanner.java:257)
at 
org.apache.impala.planner.SingleNodePlanner.createSingleNodePlan(SingleNodePlanner.java:147)
at org.apache.impala.planner.Planner.createPlan(Planner.java:101)
at 
org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1044)
at 
org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1147)
at 
org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:156)
{code}

This bug does not happen with DECIMAL_V2=true. It is specific to the implicit 
casting behavior of DECIMAL_V1 with decimal literals.

Note that the following equivalent query without the inline view works fine:
{code}
select sum(float_col + 1.2) from functional.alltypes;
{code}

Also note that this bug only happens in combination with a decimal literal. The 
following query also works fine:
{code}
create table t (f float, d decimal (2,1));
select sum(float_col + d) from (select f, d from t) v;
// works fine
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (IMPALA-6329) Wrong results for complex query with CTE, limit, group by and left join

2018-01-12 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6329.

Resolution: Not A Bug

Query contains a limit without order by and so produces non-deterministic 
results.

> Wrong results for complex query with CTE, limit, group by and left join
> ---
>
> Key: IMPALA-6329
> URL: https://issues.apache.org/jira/browse/IMPALA-6329
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.10.0
>Reporter: Alex
>Assignee: Alexander Behm
>Priority: Blocker
>  Labels: correctness
>
> Impala may generate an incorrect plan for complex query.
> (see NULL in id and a_id columns)
> Can get correct result with commented lines (1, 2, 3, 4, 5)
> Example query and incorrect plan:
> {code}
> with test as (
>   select id, b as b from(
>select 1 as id , 10 as b union all
>select 2 as id , 20 as b union all
>select 3 as id , 30 as b union all
>select 4 as id , 40 as b union all
>select 5 as id , 50 as b
>   ) t 
>   group by id, b --1
>   limit 3 --2
> ), 
> test2 as (
>   select 1 as id, 10 as a_id union all
>   select 2, 10 union all
>   select 3, 20 union all
>   select 4, 20 union all
>   select 5, 30 union all
>   select 6, 40
> )
> select * from test 
> left --3
> join
> (select id , a_id
>from (select id, a_id
>from test2
>   where id in (select id from test) --4
>   group by id, a_id) t
>   group by id, a_id --5
> ) e on test.id = e.id;
> {code}
> Result:
> {code}
> +++--+--+
> | id | b  | id   | a_id |
> +++--+--+
> | 2  | 20 | NULL | NULL |
> | 3  | 30 | 3| 20   |
> | 5  | 50 | 5| 30   |
> +++--+--+
> {code}
> Plan:
> {code}
> +--+
> | Explain String   |
> +--+
> | Max Per-Host Resource Reservation: Memory=9.69MB |
> | Per-Host Resource Estimates: Memory=41.94MB  |
> | Codegen disabled by planner  |
> |  |
> | PLAN-ROOT SINK   |
> | ||
> | 08:HASH JOIN [LEFT OUTER JOIN, BROADCAST]|
> | |  hash predicates: id = id  |
> | ||
> | |--10:EXCHANGE [UNPARTITIONED]   |
> | |  | |
> | |  07:AGGREGATE [FINALIZE]   |
> | |  |  group by: id, a_id |
> | |  | |
> | |  06:AGGREGATE [FINALIZE]   |
> | |  |  group by: id, a_id |
> | |  | |
> | |  05:HASH JOIN [LEFT SEMI JOIN, BROADCAST]  |
> | |  |  hash predicates: id = id   |
> | |  | |
> | |  |--09:EXCHANGE [UNPARTITIONED]|
> | |  |  |  |
> | |  |  04:AGGREGATE [FINALIZE]|
> | |  |  |  group by: id, b |
> | |  |  |  limit: 3|
> | |  |  |  |
> | |  |  03:UNION   |
> | |  | constant-operands=5 |
> | |  | |
> | |  02:UNION  |
> | | constant-operands=6|
> | ||
> | 01:AGGREGATE [FINALIZE]  |
> | |  group by: id, b   |
> | |  limit: 3  |
> | ||
> | 00:UNION |
> |constant-operands=5   |
> +--+
> {code}
> Correct result:
> {code}
> with test as (
>   select id, b as b from(
>select 1 as id , 10 as b union all
>select 2 as id , 20 as b union all
>select 3 as id , 30 as b union all
>select 4 as id , 40 as b union all
>select 5 as id , 50 as b
>   ) t 
>   --group by id, b --1
>   limit 3 --2
> ), 
> test2 as (
>   select 1 as id, 10 as a_id union all
>   select 2, 10 union all
>   select 3, 20 union all
>   select 4, 20 union all
>   select 5, 30 union all
>   select 6, 40
> )
> select * from test left join
> (select id , a_id
>from (select id, a_id
>from test2
>   where id in (select id from test) --3
>   group by id, 

[jira] [Resolved] (IMPALA-5310) Implement TABLESAMPLE for COMPUTE STATS

2017-12-15 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-5310.

   Resolution: Fixed
Fix Version/s: Impala 2.12.0

commit 1f7b3b00e921d68857abb22f48656eec194c
Author: Alex Behm 
Date:   Wed Dec 13 12:34:00 2017 -0800

IMPALA-5310: Part 3: Use SAMPLED_NDV() in COMPUTE STATS.

Modifies COMPUTE STATS TABLESAMPLE to use the new SAMPLED_NDV()
function.

Testing:
- modified/improved existing functional tests
- core/hdfs run passed

Change-Id: I6ec0831f77698695975e45ec0bc0364c765d819b
Reviewed-on: http://gerrit.cloudera.org:8080/8840
Reviewed-by: Alex Behm 
Tested-by: Impala Public Jenkins


> Implement TABLESAMPLE for COMPUTE STATS
> ---
>
> Key: IMPALA-5310
> URL: https://issues.apache.org/jira/browse/IMPALA-5310
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Frontend
>Reporter: Alexander Behm
>Assignee: Alexander Behm
> Fix For: Impala 2.12.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (IMPALA-5099) support percentile analytical functions

2017-12-13 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-5099.

Resolution: Duplicate

> support percentile analytical functions
> ---
>
> Key: IMPALA-5099
> URL: https://issues.apache.org/jira/browse/IMPALA-5099
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 2.6.0
>Reporter: N Campbell
>
> While many database systems support percentile_cont/disc Impala does not. 
> Percentile_cont(0.5) could also be used to simulate median which Impala does 
> not support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (IMPALA-5881) Use native allocations to workaround JVM limitations in getAllCatalogObjects()

2017-12-12 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-5881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-5881.

Resolution: Won't Fix

We decided to fix this in a different way. See IMPALA-5990

> Use native allocations to workaround JVM limitations in getAllCatalogObjects()
> --
>
> Key: IMPALA-5881
> URL: https://issues.apache.org/jira/browse/IMPALA-5881
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: bharath v
>Assignee: bharath v
>
> When {{compact_catalog_topic=true}}, we need to compact the output of 
> getAllCatalogObjects() since it can potentially be quite big (on large 
> clusters) and we have the risk of running into JVM array limitations if we 
> use the regular TBinaryProtocol.Currently we only compact the output between 
> backend < - > statestore, but it makes sense to do it between backend< - 
> >frontend in the JNI too.
> Additionally, figure out ways to use native allocations to workaround JVM 
> array size limitations during thrift serialization.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (IMPALA-5934) Impala Thrift Server

2017-12-12 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-5934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-5934.

Resolution: Incomplete

> Impala Thrift Server 
> -
>
> Key: IMPALA-5934
> URL: https://issues.apache.org/jira/browse/IMPALA-5934
> Project: IMPALA
>  Issue Type: Bug
> Environment: Cloudera 5.10.1
>Reporter: Venkat Atmuri
> Fix For: Impala 2.7.0
>
>
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> E0914 10:58:10.457620 94112 logging.cc:121] stderr will be logged to this 
> file.
> W0914 10:58:10.467237 94112 authentication.cc:1003] LDAP authentication is 
> being used with TLS, but without an --ldap_ca_certificate file, the identity 
> of the LDAP server cannot be verified.  Network communication (and hence 
> passwords) could be intercepted by a man-in-the-middle attack
> E0914 10:58:13.220167 94268 thrift-server.cc:182] ThriftServer 'backend' (on 
> port: 22000) exited due to TException: Could not bind: Transport endpoint is 
> not connected
> E0914 10:58:13.220221 94112 thrift-server.cc:171] ThriftServer 'backend' (on 
> port: 22000) did not start correctly
> F0914 10:58:13.221709 94112 impalad-main.cc:89] ThriftServer 'backend' (on 
> port: 22000) did not start correctly
> . Impalad exiting.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (IMPALA-6305) Allow column definitions in ALTER VIEW

2017-12-11 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6305:
--

 Summary: Allow column definitions in ALTER VIEW
 Key: IMPALA-6305
 URL: https://issues.apache.org/jira/browse/IMPALA-6305
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Reporter: Alexander Behm


When working with views we currently only allow separate column definitions in 
CREATE VIEW but not in ALTER VIEW.

Example:
{code}
create table t1 (c1 int, c2 int);
create view v (x comment 'hello world', y) as select * from t1;
describe v;
+--+--+-+
| name | type | comment |
+--+--+-+
| x| int  | hello world |
| y| int  | |
+--+--+-+
{code}

Currently we cannot use ALTER VIEW to change the column definitions after the 
fact, i.e. the following should be supported:

{code}
alter view v (z1, z2 comment 'foo bar') as select * from t1;
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (IMPALA-6286) Wrong results with outer join and RUNTIME_FILTER_MODE=GLOBAL

2017-12-08 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6286.

   Resolution: Fixed
Fix Version/s: Impala 2.11.0

commit 09f6b7aa733066e7bc7247b74d24a1d8b5549cf3
Author: Alex Behm 
Date:   Wed Dec 6 15:11:38 2017 -0800

IMPALA-6286: Remove invalid runtime filter targets.

If the target expression of a runtime filter evaluates to a
non-NULL value for outer-join non-matches, then assigning
the filter below the nullable side of an outer join may
lead to incorrect query results.
See IMPALA-6286 for an example and explanation.

This patch adds a conservative check that prevents the
creation of runtime filters that could potentially
have such incorrect targets. Some safe opportunities
are deliberately missed to keep the code simple.
See RuntimeFilterGenerator#getTargetSlots().

Testing:
- added planner tests which passed locally

Change-Id: I88153eea9f4b5117df60366fad2bd91776b95298
Reviewed-on: http://gerrit.cloudera.org:8080/8783
Reviewed-by: Alex Behm 
Tested-by: Impala Public Jenkins


> Wrong results with outer join and RUNTIME_FILTER_MODE=GLOBAL
> 
>
> Key: IMPALA-6286
> URL: https://issues.apache.org/jira/browse/IMPALA-6286
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, 
> Impala 2.9.0, Impala 2.10.0
>Reporter: Alexander Behm
>Assignee: Alexander Behm
>Priority: Blocker
>  Labels: correctness, planner, runtime-filters
> Fix For: Impala 2.11.0
>
>
> Queries with the following characteristics may produce wrong results due to 
> an incorrectly assigned runtime filter:
> * The query option RUNTIME_FILTER_MODE is set to GLOBAL
> * The query has an outer join
> * A scan on the nullable side of that outer join has a runtime filter with a 
> NULL-checking expression such as COALESCE/IFNULL/CASE
> * The latter point imples that there is another join above the outer join 
> with a NULL-checking expression in it's join condition
> Reproduction:
> {code}
> select count(*) from functional.alltypestiny t1
> left outer join functional.alltypestiny t2
>   on t1.id = t2.id
> where coalesce(t2.id + 10, 100) in (select 100)
> +--+
> | count(*) |
> +--+
> | 8|
> +--+
> {code}
> We expect a count of 0. A count of 8 is incorrect. 
> Query plan:
> {code}
> +---+
> | Explain String|
> +---+
> | Max Per-Host Resource Reservation: Memory=3.88MB  |
> | Per-Host Resource Estimates: Memory=87.88MB   |
> | Codegen disabled by planner   |
> |   |
> | PLAN-ROOT SINK|
> | | |
> | 10:AGGREGATE [FINALIZE]   |
> | |  output: count:merge(*) |
> | | |
> | 09:EXCHANGE [UNPARTITIONED]   |
> | | |
> | 05:AGGREGATE  |
> | |  output: count(*)   |
> | | |
> | 04:HASH JOIN [LEFT SEMI JOIN, BROADCAST]  |
> | |  hash predicates: coalesce(t2.id + 10, 100) = `$a$1`.`$c$1` |
> | |  runtime filters: RF000 <- `$a$1`.`$c$1`|
> | | |
> | |--08:EXCHANGE [BROADCAST]|
> | |  |  |
> | |  02:UNION   |
> | | constant-operands=1 |
> | | |
> | 03:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]   |
> | |  hash predicates: t1.id = t2.id |
> | | |
> | |--07:EXCHANGE [HASH(t2.id)]  |
> | |  |  |
> | |  01:SCAN HDFS [functional.alltypestiny t2]  |
> | | partitions=4/4 files=4 size=460B|

[jira] [Created] (IMPALA-6286) Wrong results with outer join and RUNTIME_FILTER_MODE=GLOBAL

2017-12-06 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6286:
--

 Summary: Wrong results with outer join and 
RUNTIME_FILTER_MODE=GLOBAL
 Key: IMPALA-6286
 URL: https://issues.apache.org/jira/browse/IMPALA-6286
 Project: IMPALA
  Issue Type: Bug
  Components: Frontend
Affects Versions: Impala 2.10.0, Impala 2.9.0, Impala 2.8.0, Impala 2.7.0, 
Impala 2.6.0, Impala 2.5.0
Reporter: Alexander Behm
Priority: Blocker


Queries with the following characteristics may produce wrong results due to an 
incorrectly assigned runtime filter:
* The query option RUNTIME_FILTER_MODE is set to GLOBAL
* The query has an outer join
* A scan on the nullable side of that outer join has a runtime filter with a 
NULL-checking expression such as COALESCE/IFNULL/CASE
* The latter point imples that there is another join above the outer join with 
a NULL-checking expression in it's join condition

Reproduction:
{code}
select count(*) from functional.alltypestiny t1
left outer join functional.alltypestiny t2
  on t1.id = t2.id
where coalesce(t2.id + 10, 100) in (select 100)
+--+
| count(*) |
+--+
| 8|
+--+
{code}
We expect a count of 0. A count of 8 is incorrect. 

Query plan:
{code}
+---+
| Explain String|
+---+
| Max Per-Host Resource Reservation: Memory=3.88MB  |
| Per-Host Resource Estimates: Memory=87.88MB   |
| Codegen disabled by planner   |
|   |
| PLAN-ROOT SINK|
| | |
| 10:AGGREGATE [FINALIZE]   |
| |  output: count:merge(*) |
| | |
| 09:EXCHANGE [UNPARTITIONED]   |
| | |
| 05:AGGREGATE  |
| |  output: count(*)   |
| | |
| 04:HASH JOIN [LEFT SEMI JOIN, BROADCAST]  |
| |  hash predicates: coalesce(t2.id + 10, 100) = `$a$1`.`$c$1` |
| |  runtime filters: RF000 <- `$a$1`.`$c$1`|
| | |
| |--08:EXCHANGE [BROADCAST]|
| |  |  |
| |  02:UNION   |
| | constant-operands=1 |
| | |
| 03:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]   |
| |  hash predicates: t1.id = t2.id |
| | |
| |--07:EXCHANGE [HASH(t2.id)]  |
| |  |  |
| |  01:SCAN HDFS [functional.alltypestiny t2]  |
| | partitions=4/4 files=4 size=460B|
| | runtime filters: RF000 -> coalesce(t2.id + 10, 100)  <--- This runtime 
filter is not correct   |
| | |
| 06:EXCHANGE [HASH(t1.id)] |
| | |
| 00:SCAN HDFS [functional.alltypestiny t1] |
|partitions=4/4 files=4 size=460B   |
+---+
{code}

Explanation:
* RF000 filters out all rows in the scan
* In join 03 there are no join matches since the right-hand is empty. All rows 
from the right-hand side are nulled.
* The join condition in join 04 now satisfies all input rows because every 
"t2.id" is NULL, so after the COALESCE() the join condition becomes 100 = 100

*Workaround*
* Set RUNTIME_FILTER_MODE to LOCAL or OFF



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (IMPALA-6237) Mismatch in plannerTest.testJoins output

2017-11-22 Thread Alexander Behm (JIRA)

 [ 
https://issues.apache.org/jira/browse/IMPALA-6237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm resolved IMPALA-6237.

Resolution: Duplicate

Very confident this is a duplicate of IMPALA-3887

> Mismatch in plannerTest.testJoins output
> 
>
> Key: IMPALA-6237
> URL: https://issues.apache.org/jira/browse/IMPALA-6237
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.11.0
>Reporter: Michael Ho
>Assignee: Tianyi Wang
>Priority: Blocker
>  Labels: broken-build
>
> PlannerTest started failing quite consistently quite recently. The followings 
> are some of the mismatch between the expected and actual outputs:
> [~tianyiwang], [~alex.behm], any chance this may be related to recent changes 
> in the planner ?
> {noformat}
> Error Message
> Section DISTRIBUTEDPLAN of query:
> select *
> from functional.alltypesagg a
> full outer join functional.alltypessmall b using (id, int_col)
> right join functional.alltypesaggnonulls c on (a.id = c.id and b.string_col = 
> c.string_col)
> where a.day >= 6
> and b.month > 2
> and c.day < 3
> and a.tinyint_col = 15
> and b.string_col = '15'
> and a.tinyint_col + b.tinyint_col < 15
> and a.float_col - c.double_col < 0
> and (b.double_col * c.tinyint_col > 1000 or c.tinyint_col < 1000)
> Actual does not match expected result:
> PLAN-ROOT SINK
> |
> 09:EXCHANGE [UNPARTITIONED]
> |
> 04:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]
> ^^^
> |  hash predicates: c.id = a.id, c.string_col = b.string_col
> |  other predicates: a.tinyint_col = 15, b.string_col = '15', a.day >= 6, 
> b.month > 2, a.float_col - c.double_col < 0, a.tinyint_col + b.tinyint_col < 
> 15, (b.double_col * c.tinyint_col > 1000 OR c.tinyint_col < 1000)
> |
> |--08:EXCHANGE [HASH(a.id,b.string_col)]
> |  |
> |  03:HASH JOIN [FULL OUTER JOIN, PARTITIONED]
> |  |  hash predicates: a.id = b.id, a.int_col = b.int_col
> |  |
> |  |--06:EXCHANGE [HASH(b.id,b.int_col)]
> |  |  |
> |  |  01:SCAN HDFS [functional.alltypessmall b]
> |  | partitions=2/4 files=2 size=3.17KB
> |  | predicates: b.string_col = '15'
> |  |
> |  05:EXCHANGE [HASH(a.id,a.int_col)]
> |  |
> |  00:SCAN HDFS [functional.alltypesagg a]
> | partitions=5/11 files=5 size=372.38KB
> | predicates: a.tinyint_col = 15
> |
> 07:EXCHANGE [HASH(c.id,c.string_col)]
> |
> 02:SCAN HDFS [functional.alltypesaggnonulls c]
>partitions=2/10 files=2 size=148.10KB
> Expected:
> PLAN-ROOT SINK
> |
> 09:EXCHANGE [UNPARTITIONED]
> |
> 04:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]
> |  hash predicates: a.id = c.id, b.string_col = c.string_col
> |  other predicates: a.tinyint_col = 15, b.string_col = '15', a.day >= 6, 
> b.month > 2, a.float_col - c.double_col < 0, a.tinyint_col + b.tinyint_col < 
> 15, (b.double_col * c.tinyint_col > 1000 OR c.tinyint_col < 1000)
> |  runtime filters: RF000 <- c.id, RF001 <- c.string_col
> |
> |--08:EXCHANGE [HASH(c.id,c.string_col)]
> |  |
> |  02:SCAN HDFS [functional.alltypesaggnonulls c]
> | partitions=2/10 files=2 size=148.10KB
> |
> 07:EXCHANGE [HASH(a.id,b.string_col)]
> |
> 03:HASH JOIN [FULL OUTER JOIN, PARTITIONED]
> |  hash predicates: a.id = b.id, a.int_col = b.int_col
> |
> |--06:EXCHANGE [HASH(b.id,b.int_col)]
> |  |
> |  01:SCAN HDFS [functional.alltypessmall b]
> | partitions=2/4 files=2 size=3.17KB
> | predicates: b.string_col = '15'
> | runtime filters: RF001 -> b.string_col
> |
> 05:EXCHANGE [HASH(a.id,a.int_col)]
> |
> 00:SCAN HDFS [functional.alltypesagg a]
>partitions=5/11 files=5 size=372.38KB
>predicates: a.tinyint_col = 15
>runtime filters: RF000 -> a.id
> Verbose plan:
> F05:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> Per-Host Resources: mem-estimate=0B mem-reservation=0B
>   PLAN-ROOT SINK
>   |  mem-estimate=0B mem-reservation=0B
>   |
>   09:EXCHANGE [UNPARTITIONED]
>  mem-estimate=0B mem-reservation=0B
>  tuple-ids=2,0N,1N row-size=303B cardinality=2000
> F04:PLAN FRAGMENT [HASH(c.id,c.string_col)] hosts=2 instances=2
> Per-Host Resources: mem-estimate=1.94MB mem-reservation=1.94MB
>   DATASTREAM SINK [FRAGMENT=F05, EXCHANGE=09, UNPARTITIONED]
>   |  mem-estimate=0B mem-reservation=0B
>   04:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]
>   |  hash predicates: c.id = a.id, c.string_col = b.string_col
>   |  fk/pk conjuncts: c.id = a.id
>   |  other predicates: a.tinyint_col = 15, b.string_col = '15', a.day >= 6, 
> b.month > 2, a.float_col - c.double_col < 0, a.tinyint_col + b.tinyint_col < 
> 15, (b.double_col * c.tinyint_col > 1000 OR c.tinyint_col < 1000)
>   |  mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB
>   |  tuple-ids=2,0N,1N row-size=303B cardinality=2000
>   |
>   |--08:EXCHANGE [HASH(a.id,b.string_col)]
>   | 

[jira] [Created] (IMPALA-6233) Document the column definitions list in CREATE VIEW

2017-11-21 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6233:
--

 Summary: Document the column definitions list in CREATE VIEW
 Key: IMPALA-6233
 URL: https://issues.apache.org/jira/browse/IMPALA-6233
 Project: IMPALA
  Issue Type: Improvement
  Components: Docs
Affects Versions: Impala 2.10.0
Reporter: Alexander Behm
Assignee: John Russell


Looking at this page:
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_create_view.html#create_view

It appears we do not have an example for the "columns_list" that shows adding a 
comment to a column. We should add that.

Example:
{code}
create table t1 (c1 int, c2 int);
create view v (x comment 'hello world', y) as select * from t1;
describe v;
+--+--+-+
| name | type | comment |
+--+--+-+
| x| int  | hello world |
| y| int  | |
+--+--+-+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (IMPALA-6228) More flexible configuration of stats extrapolation

2017-11-20 Thread Alexander Behm (JIRA)
Alexander Behm created IMPALA-6228:
--

 Summary: More flexible configuration of stats extrapolation
 Key: IMPALA-6228
 URL: https://issues.apache.org/jira/browse/IMPALA-6228
 Project: IMPALA
  Issue Type: Improvement
  Components: Frontend
Affects Versions: Impala 2.10.0, Impala 2.11.0
Reporter: Alexander Behm


For stats extrapolation (IMPALA-2373) and COMPUTE STATS TABLESMAPLE 
(IMPALA-5310) we currently require an impalad startup option 
-enable_stats_extrapolation to be set.

It would be nice if changing that configuration would not require a service 
restart.
For example, we could consider a query option instead of adding a table 
property to tables where extrapolation should be enabled.

The reason for the current behavior is as follows:
It is technically not required to be a startup option, but it reduces the 
number of ways users can shoot themselves in the foot. For example, first 
running COMPUTE STATS TABLESAMPLE on a table T and then running a query against 
table T without stats extrapolation does not make sense and will not work well. 
This subtle behavior might not be clear to users. Yes, that can be addressed 
with warnings etc., but preventing non-sensical combinations seems better until 
we have strong evidence against that conservative approach.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)