[jira] [Created] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true

2020-07-29 Thread Eugene Chung (Jira)
Eugene Chung created HIVE-23954:
---

 Summary: count(*) with count(distinct) gives wrong results with 
hive.optimize.countdistinct=true
 Key: HIVE-23954
 URL: https://issues.apache.org/jira/browse/HIVE-23954
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer
Affects Versions: 3.1.0, 3.0.0
Reporter: Eugene Chung


select count(*), count(distinct mycol) from db1.table1 where partitioned_column 
= '...'

 

is not working properly when hive.optimize.countdistinct is true. By default, 
it's true for all 3.x versions.

In the two plans below, the aggregations part in the Output of Group By 
Operator of Map 1 are different.

 

- hive.optimize.countdistinct=false
{code:java}
++
|  Explain   |
++
| Plan optimized by CBO. |
||
| Vertex dependency in root stage|
| Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
||
| Stage-0|
|   Fetch Operator   |
| limit:-1   |
| Stage-1|
|   Reducer 2|
|   File Output Operator [FS_7]  |
| Group By Operator [GBY_5] (rows=1 width=24) |
|   
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT 
KEY._col0:0._col0)"] |
| <-Map 1 [SIMPLE_EDGE]  |
|   SHUFFLE [RS_4]   |
| Group By Operator [GBY_3] (rows=343640771 width=4160) |
|   
Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT 
mid)"],keys:mid |
|   Select Operator [SEL_2] (rows=343640771 width=4160) |
| Output:["mid"] |
| TableScan [TS_0] (rows=343640771 width=4160) |
|   db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
||
++{code}
 

- hive.optimize.countdistinct=true
{code:java}
++
|  Explain   |
++
| Plan optimized by CBO. |
||
| Vertex dependency in root stage|
| Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
||
| Stage-0|
|   Fetch Operator   |
| limit:-1   |
| Stage-1|
|   Reducer 2|
|   File Output Operator [FS_7]  |
| Group By Operator [GBY_14] (rows=1 width=16) |
|   
Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
|   Group By Operator [GBY_11] (rows=343640771 width=4160) |
| 
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
|   <-Map 1 [SIMPLE_EDGE]|
| SHUFFLE [RS_10]|
|   PartitionCols:_col0  |
|   Group By Operator [GBY_9] (rows=343640771 width=4160) |
| Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
| Select Operator [SEL_2] (rows=343640771 width=4160) |
|   Output:["mid"]   |
|   TableScan [TS_0] (rows=343640771 width=4160) |
| db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
||
++
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23953) Use task counter information to compute keycount during hashtable loading

2020-07-29 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23953:
---

 Summary: Use task counter information to compute keycount during 
hashtable loading
 Key: HIVE-23953
 URL: https://issues.apache.org/jira/browse/HIVE-23953
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan


There are cases when compiler misestimates key count and this results in a 
number of hashtable resizes during runtime.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java#L128]

In such cases, it would be good to get "approximate_input_records" (TEZ-4207) 
counter from upstream to compute the key count more accurately at runtime.

 
 * 
 * 
Options
h4.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23952) Reuse VectorAggregationBuffer to reduce GC pressure in VectorGroupByOperator

2020-07-29 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23952:
---

 Summary: Reuse VectorAggregationBuffer to reduce GC pressure in 
VectorGroupByOperator
 Key: HIVE-23952
 URL: https://issues.apache.org/jira/browse/HIVE-23952
 Project: Hive
  Issue Type: Bug
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-07-30 at 7.38.13 AM.png

!Screenshot 2020-07-30 at 7.38.13 AM.png|width=1171,height=892!

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java]
{code:java}
aggregationBuffer = allocateAggregationBuffer(); {code}
Flushed out aggregation buffers could be reused instead of allocating everytime 
here, to reduce GC pressure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23951) Support parameterized queries in WHERE/HAVING clause

2020-07-29 Thread Vineet Garg (Jira)
Vineet Garg created HIVE-23951:
--

 Summary: Support parameterized queries in WHERE/HAVING clause
 Key: HIVE-23951
 URL: https://issues.apache.org/jira/browse/HIVE-23951
 Project: Hive
  Issue Type: Sub-task
  Components: Query Planning
Reporter: Vineet Garg
Assignee: Vineet Garg






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23950) Support PREPARE and EXECUTE statements

2020-07-29 Thread Vineet Garg (Jira)
Vineet Garg created HIVE-23950:
--

 Summary: Support PREPARE and EXECUTE statements
 Key: HIVE-23950
 URL: https://issues.apache.org/jira/browse/HIVE-23950
 Project: Hive
  Issue Type: New Feature
  Components: Query Planning
Reporter: Vineet Garg
Assignee: Vineet Garg


PREPARE and EXECUTE statements provide an ability to create a parameterized 
query and re-use it to execute with different parameters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23949) Introduce caching layer in HS2 to accelerate query compilation

2020-07-29 Thread Soumyakanti Das (Jira)
Soumyakanti Das created HIVE-23949:
--

 Summary: Introduce caching layer in HS2 to accelerate query 
compilation
 Key: HIVE-23949
 URL: https://issues.apache.org/jira/browse/HIVE-23949
 Project: Hive
  Issue Type: New Feature
  Components: HiveServer2
Reporter: Soumyakanti Das
Assignee: Soumyakanti Das


One of the major contributors to compilation latency is the retrieval of 
metadata from HMS. This JIRA introduces a caching layer in HS2 for this 
metadata. Its design is simple, relying on snapshot information being queried 
to cache and invalidate the metadata. This will help us to reduce the time 
spent in compilation by using HS2 memory more effectively, and it will allow us 
to improve HMS throughput for multi-tenant workloads by reducing the number of 
calls it needs to serve.
This patch only caches partition retrieval information, which is often one of 
the most costly metadata operations. It also sets the foundation to cache 
additional calls in follow-up work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23948) Improve Query Results Cache

2020-07-29 Thread Hunter Logan (Jira)
Hunter Logan created HIVE-23948:
---

 Summary: Improve Query Results Cache
 Key: HIVE-23948
 URL: https://issues.apache.org/jira/browse/HIVE-23948
 Project: Hive
  Issue Type: Improvement
Reporter: Hunter Logan


Creating a Jira for this github PR from before github was actively used

[https://github.com/apache/hive/pull/652]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23947) Cache affinity is unset for text files read by LLAP

2020-07-29 Thread Jira
Ádám Szita created HIVE-23947:
-

 Summary: Cache affinity is unset for text files read by LLAP
 Key: HIVE-23947
 URL: https://issues.apache.org/jira/browse/HIVE-23947
 Project: Hive
  Issue Type: Bug
Reporter: Ádám Szita
Assignee: Ádám Szita


LLAP relies on HostAffinitySplitLocationProvider to route the same splits to 
always the same LLAP daemons. By having such consistent split of data among the 
nodes we can gain a good hit ratio and thus good performance.

For text files this is almost never granted: HostAffinitySplitLocationProvider 
is never used, because HS2 does not set the cache affinity flag in the job conf 
for text inputformat content during compile. The launched Tez AM will have to 
rely on HDFS location information to route the splits (and therefore tasks) to 
the executor nodes. This location information might not have a good overlap 
with where the actual daemons are, or in S3 case, the Tez AM will mostly choose 
executors in a random way.

This in turn will result in the hit ratio hardly reaching 100%, each time we 
re-run the same query, some disk/s3 read will still occur. That is until the 
same content gets populated into all the daemons (after running the query tens 
or hundreds of times) causing poor performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23946) Improve control flow and error handling in QTest dataset loading/unloading

2020-07-29 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23946:
--

 Summary: Improve control flow and error handling in QTest dataset 
loading/unloading
 Key: HIVE-23946
 URL: https://issues.apache.org/jira/browse/HIVE-23946
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


This issue focuses mainly on the following methods:
[QTestDatasetHandler#initDataset| 
https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java#L76]
[QTestDatasetHandler#unloadDataset|https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java#L95]

related to QTest dataset loading and unloading.

The boolean return type in these methods is redundant since they either fail or 
return true (they never return false).

The methods should throw an Exception instead of an AssertionError to indicate 
failure. This allows code higher up the stack to perform proper recovery and 
properly report the failure. At the moment, if an AssertionError is raised from 
these methods dependent code (eg., 
[CoreCliDriver|https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/cli/control/CoreCliDriver.java#L188])
 fails to notice that the query has failed. 

In case of failure in loading/unloading the environment (instance and class 
variables) is not properly cleaned leading to failures in all subsequent tests.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23945) Fix TestCodahaleReportersConf

2020-07-29 Thread Zoltan Haindrich (Jira)
Zoltan Haindrich created HIVE-23945:
---

 Summary: Fix TestCodahaleReportersConf
 Key: HIVE-23945
 URL: https://issues.apache.org/jira/browse/HIVE-23945
 Project: Hive
  Issue Type: Bug
Reporter: Zoltan Haindrich


http://ci.hive.apache.org/job/hive-precommit/job/master/138/testReport/junit/org.apache.hadoop.hive.common.metrics.metrics2/TestCodahaleReportersConf/Testing___split_11___Archive___testNoFallback/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23944) Stabilize TestStatsReplicationScenariosMMNoAutogather

2020-07-29 Thread Zoltan Haindrich (Jira)
Zoltan Haindrich created HIVE-23944:
---

 Summary: Stabilize TestStatsReplicationScenariosMMNoAutogather
 Key: HIVE-23944
 URL: https://issues.apache.org/jira/browse/HIVE-23944
 Project: Hive
  Issue Type: Bug
Reporter: Zoltan Haindrich


http://ci.hive.apache.org/job/hive-precommit/job/master/130/testReport/junit/org.apache.hadoop.hive.ql.parse/TestStatsReplicationScenariosMMNoAutogather/Testing___split_16___Archive___testMetadataOnlyDump/
http://ci.hive.apache.org/job/hive-precommit/job/PR-1303/lastCompletedBuild/testReport/org.apache.hadoop.hive.ql.parse/TestStatsReplicationScenariosMigrationNoAutogather/Testing___split_09___Archive___testRetryFailure/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)