[jira] [Created] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true
Eugene Chung created HIVE-23954: --- Summary: count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true Key: HIVE-23954 URL: https://issues.apache.org/jira/browse/HIVE-23954 Project: Hive Issue Type: Bug Components: Logical Optimizer Affects Versions: 3.1.0, 3.0.0 Reporter: Eugene Chung select count(*), count(distinct mycol) from db1.table1 where partitioned_column = '...' is not working properly when hive.optimize.countdistinct is true. By default, it's true for all 3.x versions. In the two plans below, the aggregations part in the Output of Group By Operator of Map 1 are different. - hive.optimize.countdistinct=false {code:java} ++ | Explain | ++ | Plan optimized by CBO. | || | Vertex dependency in root stage| | Reducer 2 <- Map 1 (SIMPLE_EDGE) | || | Stage-0| | Fetch Operator | | limit:-1 | | Stage-1| | Reducer 2| | File Output Operator [FS_7] | | Group By Operator [GBY_5] (rows=1 width=24) | | Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT KEY._col0:0._col0)"] | | <-Map 1 [SIMPLE_EDGE] | | SHUFFLE [RS_4] | | Group By Operator [GBY_3] (rows=343640771 width=4160) | | Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT mid)"],keys:mid | | Select Operator [SEL_2] (rows=343640771 width=4160) | | Output:["mid"] | | TableScan [TS_0] (rows=343640771 width=4160) | | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] | || ++{code} - hive.optimize.countdistinct=true {code:java} ++ | Explain | ++ | Plan optimized by CBO. | || | Vertex dependency in root stage| | Reducer 2 <- Map 1 (SIMPLE_EDGE) | || | Stage-0| | Fetch Operator | | limit:-1 | | Stage-1| | Reducer 2| | File Output Operator [FS_7] | | Group By Operator [GBY_14] (rows=1 width=16) | | Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] | | Group By Operator [GBY_11] (rows=343640771 width=4160) | | Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 | | <-Map 1 [SIMPLE_EDGE]| | SHUFFLE [RS_10]| | PartitionCols:_col0 | | Group By Operator [GBY_9] (rows=343640771 width=4160) | | Output:["_col0","_col1"],aggregations:["count()"],keys:mid | | Select Operator [SEL_2] (rows=343640771 width=4160) | | Output:["mid"] | | TableScan [TS_0] (rows=343640771 width=4160) | | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] | || ++ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23953) Use task counter information to compute keycount during hashtable loading
Rajesh Balamohan created HIVE-23953: --- Summary: Use task counter information to compute keycount during hashtable loading Key: HIVE-23953 URL: https://issues.apache.org/jira/browse/HIVE-23953 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan There are cases when compiler misestimates key count and this results in a number of hashtable resizes during runtime. [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin/fast/VectorMapJoinFastHashTableLoader.java#L128] In such cases, it would be good to get "approximate_input_records" (TEZ-4207) counter from upstream to compute the key count more accurately at runtime. * * Options h4. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23952) Reuse VectorAggregationBuffer to reduce GC pressure in VectorGroupByOperator
Rajesh Balamohan created HIVE-23952: --- Summary: Reuse VectorAggregationBuffer to reduce GC pressure in VectorGroupByOperator Key: HIVE-23952 URL: https://issues.apache.org/jira/browse/HIVE-23952 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan Attachments: Screenshot 2020-07-30 at 7.38.13 AM.png !Screenshot 2020-07-30 at 7.38.13 AM.png|width=1171,height=892! [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java] {code:java} aggregationBuffer = allocateAggregationBuffer(); {code} Flushed out aggregation buffers could be reused instead of allocating everytime here, to reduce GC pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23951) Support parameterized queries in WHERE/HAVING clause
Vineet Garg created HIVE-23951: -- Summary: Support parameterized queries in WHERE/HAVING clause Key: HIVE-23951 URL: https://issues.apache.org/jira/browse/HIVE-23951 Project: Hive Issue Type: Sub-task Components: Query Planning Reporter: Vineet Garg Assignee: Vineet Garg -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23950) Support PREPARE and EXECUTE statements
Vineet Garg created HIVE-23950: -- Summary: Support PREPARE and EXECUTE statements Key: HIVE-23950 URL: https://issues.apache.org/jira/browse/HIVE-23950 Project: Hive Issue Type: New Feature Components: Query Planning Reporter: Vineet Garg Assignee: Vineet Garg PREPARE and EXECUTE statements provide an ability to create a parameterized query and re-use it to execute with different parameters. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23949) Introduce caching layer in HS2 to accelerate query compilation
Soumyakanti Das created HIVE-23949: -- Summary: Introduce caching layer in HS2 to accelerate query compilation Key: HIVE-23949 URL: https://issues.apache.org/jira/browse/HIVE-23949 Project: Hive Issue Type: New Feature Components: HiveServer2 Reporter: Soumyakanti Das Assignee: Soumyakanti Das One of the major contributors to compilation latency is the retrieval of metadata from HMS. This JIRA introduces a caching layer in HS2 for this metadata. Its design is simple, relying on snapshot information being queried to cache and invalidate the metadata. This will help us to reduce the time spent in compilation by using HS2 memory more effectively, and it will allow us to improve HMS throughput for multi-tenant workloads by reducing the number of calls it needs to serve. This patch only caches partition retrieval information, which is often one of the most costly metadata operations. It also sets the foundation to cache additional calls in follow-up work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23948) Improve Query Results Cache
Hunter Logan created HIVE-23948: --- Summary: Improve Query Results Cache Key: HIVE-23948 URL: https://issues.apache.org/jira/browse/HIVE-23948 Project: Hive Issue Type: Improvement Reporter: Hunter Logan Creating a Jira for this github PR from before github was actively used [https://github.com/apache/hive/pull/652] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23947) Cache affinity is unset for text files read by LLAP
Ádám Szita created HIVE-23947: - Summary: Cache affinity is unset for text files read by LLAP Key: HIVE-23947 URL: https://issues.apache.org/jira/browse/HIVE-23947 Project: Hive Issue Type: Bug Reporter: Ádám Szita Assignee: Ádám Szita LLAP relies on HostAffinitySplitLocationProvider to route the same splits to always the same LLAP daemons. By having such consistent split of data among the nodes we can gain a good hit ratio and thus good performance. For text files this is almost never granted: HostAffinitySplitLocationProvider is never used, because HS2 does not set the cache affinity flag in the job conf for text inputformat content during compile. The launched Tez AM will have to rely on HDFS location information to route the splits (and therefore tasks) to the executor nodes. This location information might not have a good overlap with where the actual daemons are, or in S3 case, the Tez AM will mostly choose executors in a random way. This in turn will result in the hit ratio hardly reaching 100%, each time we re-run the same query, some disk/s3 read will still occur. That is until the same content gets populated into all the daemons (after running the query tens or hundreds of times) causing poor performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23946) Improve control flow and error handling in QTest dataset loading/unloading
Stamatis Zampetakis created HIVE-23946: -- Summary: Improve control flow and error handling in QTest dataset loading/unloading Key: HIVE-23946 URL: https://issues.apache.org/jira/browse/HIVE-23946 Project: Hive Issue Type: Improvement Reporter: Stamatis Zampetakis Assignee: Stamatis Zampetakis This issue focuses mainly on the following methods: [QTestDatasetHandler#initDataset| https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java#L76] [QTestDatasetHandler#unloadDataset|https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/ql/dataset/QTestDatasetHandler.java#L95] related to QTest dataset loading and unloading. The boolean return type in these methods is redundant since they either fail or return true (they never return false). The methods should throw an Exception instead of an AssertionError to indicate failure. This allows code higher up the stack to perform proper recovery and properly report the failure. At the moment, if an AssertionError is raised from these methods dependent code (eg., [CoreCliDriver|https://github.com/apache/hive/blob/6fbd54c0af60276d49b237defb550938c9c32610/itests/util/src/main/java/org/apache/hadoop/hive/cli/control/CoreCliDriver.java#L188]) fails to notice that the query has failed. In case of failure in loading/unloading the environment (instance and class variables) is not properly cleaned leading to failures in all subsequent tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23945) Fix TestCodahaleReportersConf
Zoltan Haindrich created HIVE-23945: --- Summary: Fix TestCodahaleReportersConf Key: HIVE-23945 URL: https://issues.apache.org/jira/browse/HIVE-23945 Project: Hive Issue Type: Bug Reporter: Zoltan Haindrich http://ci.hive.apache.org/job/hive-precommit/job/master/138/testReport/junit/org.apache.hadoop.hive.common.metrics.metrics2/TestCodahaleReportersConf/Testing___split_11___Archive___testNoFallback/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23944) Stabilize TestStatsReplicationScenariosMMNoAutogather
Zoltan Haindrich created HIVE-23944: --- Summary: Stabilize TestStatsReplicationScenariosMMNoAutogather Key: HIVE-23944 URL: https://issues.apache.org/jira/browse/HIVE-23944 Project: Hive Issue Type: Bug Reporter: Zoltan Haindrich http://ci.hive.apache.org/job/hive-precommit/job/master/130/testReport/junit/org.apache.hadoop.hive.ql.parse/TestStatsReplicationScenariosMMNoAutogather/Testing___split_16___Archive___testMetadataOnlyDump/ http://ci.hive.apache.org/job/hive-precommit/job/PR-1303/lastCompletedBuild/testReport/org.apache.hadoop.hive.ql.parse/TestStatsReplicationScenariosMigrationNoAutogather/Testing___split_09___Archive___testRetryFailure/ -- This message was sent by Atlassian Jira (v8.3.4#803005)