[jira] [Work logged] (HIVE-23851) MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions
[ https://issues.apache.org/jira/browse/HIVE-23851?focusedWorklogId=497097=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-497097 ] ASF GitHub Bot logged work on HIVE-23851: - Author: ASF GitHub Bot Created on: 08/Oct/20 05:08 Start Date: 08/Oct/20 05:08 Worklog Time Spent: 10m Work Description: shameersss1 commented on pull request #1271: URL: https://github.com/apache/hive/pull/1271#issuecomment-705332460 @kgyrtkirk Thank you taking your time in reviewing this! Yes, +1 from my side for (deprecating) kyro stuffs. String based approach is cool. But i am not sure how easy or difficult will it be to do the changes. I will try to explore this from my side as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 497097) Time Spent: 5h (was: 4h 50m) > MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions > > > Key: HIVE-23851 > URL: https://issues.apache.org/jira/browse/HIVE-23851 > Project: Hive > Issue Type: Bug >Affects Versions: 4.0.0 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 5h > Remaining Estimate: 0h > > *Steps to reproduce:* > # Create external table > # Run msck command to sync all the partitions with metastore > # Remove one of the partition path > # Run msck repair with partition filtering > *Stack Trace:* > {code:java} > 2020-07-15T02:10:29,045 ERROR [4dad298b-28b1-4e6b-94b6-aa785b60c576 main] > ppr.PartitionExpressionForMetastore: Failed to deserialize the expression > java.lang.IndexOutOfBoundsException: Index: 110, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:657) ~[?:1.8.0_192] > at java.util.ArrayList.get(ArrayList.java:433) ~[?:1.8.0_192] > at > org.apache.hive.com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hive.com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:707) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:211) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeObjectFromKryo(SerializationUtilities.java:806) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpressionFromKryo(SerializationUtilities.java:775) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.deserializeExpr(PartitionExpressionForMetastore.java:96) > [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.convertExprToFilter(PartitionExpressionForMetastore.java:52) > [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.PartFilterExprUtil.makeExpressionTree(PartFilterExprUtil.java:48) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:3593) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:80) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT-tests.jar:4.0.0-SNAPSHOT] > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_192] > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:1.8.0_192] > {code} > *Cause:* > In case of msck repair with partition filtering we expect expression proxy > class to be set as PartitionExpressionForMetastore ( > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckAnalyzer.java#L78 > ), While dropping partition we serialize the drop partition filter > expression as ( > https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java#L589
[jira] [Work logged] (HIVE-23811) deleteReader SARG rowId/bucketId are not getting validated properly
[ https://issues.apache.org/jira/browse/HIVE-23811?focusedWorklogId=497027=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-497027 ] ASF GitHub Bot logged work on HIVE-23811: - Author: ASF GitHub Bot Created on: 08/Oct/20 00:44 Start Date: 08/Oct/20 00:44 Worklog Time Spent: 10m Work Description: github-actions[bot] commented on pull request #1214: URL: https://github.com/apache/hive/pull/1214#issuecomment-705266436 This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the d...@hive.apache.org list if the patch is in need of reviews. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 497027) Time Spent: 0.5h (was: 20m) > deleteReader SARG rowId/bucketId are not getting validated properly > --- > > Key: HIVE-23811 > URL: https://issues.apache.org/jira/browse/HIVE-23811 > Project: Hive > Issue Type: Bug >Reporter: Naresh P R >Assignee: Naresh P R >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Though we are iterating over min/max stripeIndex, we always seem to pick > ColumnStats from first stripe > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java#L596] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23955) Classification of Error Codes in Replication
[ https://issues.apache.org/jira/browse/HIVE-23955?focusedWorklogId=497025=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-497025 ] ASF GitHub Bot logged work on HIVE-23955: - Author: ASF GitHub Bot Created on: 08/Oct/20 00:43 Start Date: 08/Oct/20 00:43 Worklog Time Spent: 10m Work Description: github-actions[bot] commented on pull request #1358: URL: https://github.com/apache/hive/pull/1358#issuecomment-705266417 This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the d...@hive.apache.org list if the patch is in need of reviews. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 497025) Time Spent: 1h 50m (was: 1h 40m) > Classification of Error Codes in Replication > > > Key: HIVE-23955 > URL: https://issues.apache.org/jira/browse/HIVE-23955 > Project: Hive > Issue Type: Task >Reporter: Aasha Medhi >Assignee: Aasha Medhi >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23955.01.patch, HIVE-23955.02.patch, > HIVE-23955.03.patch, HIVE-23955.04.patch, Retry Logic for Replication.pdf > > Time Spent: 1h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21611) Date.getTime() can be changed to System.currentTimeMillis()
[ https://issues.apache.org/jira/browse/HIVE-21611?focusedWorklogId=497026=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-497026 ] ASF GitHub Bot logged work on HIVE-21611: - Author: ASF GitHub Bot Created on: 08/Oct/20 00:43 Start Date: 08/Oct/20 00:43 Worklog Time Spent: 10m Work Description: github-actions[bot] closed pull request #595: URL: https://github.com/apache/hive/pull/595 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 497026) Time Spent: 2h 10m (was: 2h) > Date.getTime() can be changed to System.currentTimeMillis() > --- > > Key: HIVE-21611 > URL: https://issues.apache.org/jira/browse/HIVE-21611 > Project: Hive > Issue Type: Bug >Reporter: bd2019us >Assignee: Hunter Logan >Priority: Major > Labels: pull-request-available > Attachments: 1.patch > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Hello, > I found that System.currentTimeMillis() can be used here instead of new > Date.getTime(). > Since new Date() is a thin wrapper of light method > System.currentTimeMillis(). The performance will be greatly damaged if it is > invoked too much times. > According to my local testing at the same environment, > System.currentTimeMillis() can achieve a speedup to 5 times (435 ms vs 2073 > ms), when these two methods are invoked 5,000,000 times. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24232) Incorrect translation of rollup expression from Calcite
[ https://issues.apache.org/jira/browse/HIVE-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jesus Camacho Rodriguez updated HIVE-24232: --- Fix Version/s: 4.0.0 Resolution: Fixed Status: Resolved (was: Patch Available) > Incorrect translation of rollup expression from Calcite > --- > > Key: HIVE-24232 > URL: https://issues.apache.org/jira/browse/HIVE-24232 > Project: Hive > Issue Type: Bug > Components: CBO >Reporter: Jesus Camacho Rodriguez >Assignee: Jesus Camacho Rodriguez >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > In Calcite, it is not necessary that the columns in the group set are in the > same order as the rollup. For instance, this is the Calcite representation of > a rollup for a given query: > {code} > HiveAggregate(group=[{1, 6, 7}], groups=[[{1, 6, 7}, {1, 7}, {1}, {}]], > agg#0=[sum($12)], agg#1=[count($12)], agg#2=[sum($4)], agg#3=[count($4)], > agg#4=[sum($15)], agg#5=[count($15)]) > {code} > When we generate the Hive plan from the Calcite operator, we make such > assumption incorrectly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24232) Incorrect translation of rollup expression from Calcite
[ https://issues.apache.org/jira/browse/HIVE-24232?focusedWorklogId=496961=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496961 ] ASF GitHub Bot logged work on HIVE-24232: - Author: ASF GitHub Bot Created on: 07/Oct/20 22:03 Start Date: 07/Oct/20 22:03 Worklog Time Spent: 10m Work Description: jcamachor merged pull request #1554: URL: https://github.com/apache/hive/pull/1554 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496961) Time Spent: 0.5h (was: 20m) > Incorrect translation of rollup expression from Calcite > --- > > Key: HIVE-24232 > URL: https://issues.apache.org/jira/browse/HIVE-24232 > Project: Hive > Issue Type: Bug > Components: CBO >Reporter: Jesus Camacho Rodriguez >Assignee: Jesus Camacho Rodriguez >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > In Calcite, it is not necessary that the columns in the group set are in the > same order as the rollup. For instance, this is the Calcite representation of > a rollup for a given query: > {code} > HiveAggregate(group=[{1, 6, 7}], groups=[[{1, 6, 7}, {1, 7}, {1}, {}]], > agg#0=[sum($12)], agg#1=[count($12)], agg#2=[sum($4)], agg#3=[count($4)], > agg#4=[sum($15)], agg#5=[count($15)]) > {code} > When we generate the Hive plan from the Calcite operator, we make such > assumption incorrectly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator
[ https://issues.apache.org/jira/browse/HIVE-24203?focusedWorklogId=496802=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496802 ] ASF GitHub Bot logged work on HIVE-24203: - Author: ASF GitHub Bot Created on: 07/Oct/20 17:55 Start Date: 07/Oct/20 17:55 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on a change in pull request #1531: URL: https://github.com/apache/hive/pull/1531#discussion_r501203341 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java ## @@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, } } + /** + * LateralViewJoinOperator changes the data size and column level statistics. + * + * A diagram of LATERAL VIEW. + * + * [Lateral View Forward] + * / \ + *[Select] [Select] + *|| + *| [UDTF] + *\ / + * [Lateral View Join] + * + * For each row of the source, the left branch just picks columns and the right branch processes UDTF. + * And then LVJ joins a row from the left branch with rows from the right branch. + * The join has one-to-many relationship since UDTF can generate multiple rows. + * + * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides. + */ + public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor { +@Override +public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, + Object... nodeOutputs) throws SemanticException { + final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd; + final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx; + final HiveConf conf = aspCtx.getConf(); + + if (!isAllParentsContainStatistics(lop)) { +return null; + } + + final List> parents = lop.getParentOperators(); + if (parents.size() != 2) { +LOG.warn("LateralViewJoinOperator should have just two parents but actually has " ++ parents.size() + " parents."); +return null; + } + + final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics(); + final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics(); + + final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows(); + final long selectDataSize = StatsUtils.safeMult(selectStats.getDataSize(), factor); + final long dataSize = StatsUtils.safeAdd(selectDataSize, udtfStats.getDataSize()); + Statistics joinedStats = new Statistics(udtfStats.getNumRows(), dataSize, 0, 0); + + if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) { +final Map columnExprMap = lop.getColumnExprMap(); +final RowSchema schema = lop.getSchema(); + +joinedStats.updateColumnStatsState(selectStats.getColumnStatsState()); +final List selectColStats = StatsUtils +.getColStatisticsFromExprMap(conf, selectStats, columnExprMap, schema); +joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor)); + +joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState()); +final List udtfColStats = StatsUtils +.getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, schema); +joinedStats.addToColumnStats(udtfColStats); + +joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop); +lop.setStatistics(joinedStats); + +if (LOG.isDebugEnabled()) { + LOG.debug("[0] STATS-" + lop.toString() + ": " + joinedStats.extendedToString()); +} Review comment: I don't know what's the point of these `[0]`/`[1]` markers; from one of the historical commits it seems to me like these are some kind of "log message indexes" inside the method I think we could stop doing that... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496802) Time Spent: 2h (was: 1h 50m) > Implement stats annotation rule for the LateralViewJoinOperator > --- > > Key: HIVE-24203 > URL: https://issues.apache.org/jira/browse/HIVE-24203 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Affects
[jira] [Commented] (HIVE-22344) I can't run hive in command line
[ https://issues.apache.org/jira/browse/HIVE-22344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209714#comment-17209714 ] Amit Singh commented on HIVE-22344: --- Replace the Hive jar under /lib with the one shipped with Hadoop. > I can't run hive in command line > > > Key: HIVE-22344 > URL: https://issues.apache.org/jira/browse/HIVE-22344 > Project: Hive > Issue Type: Bug > Components: CLI >Affects Versions: 3.1.2 > Environment: hive: 3.1.2 > hadoop 3.2.1 > >Reporter: Smith Cruise >Priority: Blocker > > I can't run hive in command. It tell me : > {code:java} > [hadoop@master lib]$ hive > which: no hbase in > (/home/hadoop/apache-hive-3.1.2-bin/bin:{{pwd}}/bin:/home/hadoop/.local/bin:/home/hadoop/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin) > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/home/hadoop/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/home/hadoop/hadoop3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] > Exception in thread "main" java.lang.NoSuchMethodError: > com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V > at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357) > at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338) > at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:536) > at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:554) > at org.apache.hadoop.mapred.JobConf.(JobConf.java:448) > at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:5141) > at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:5099) > at > org.apache.hadoop.hive.common.LogUtils.initHiveLog4jCommon(LogUtils.java:97) > at > org.apache.hadoop.hive.common.LogUtils.initHiveLog4j(LogUtils.java:81) > at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:699) > at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:323) > at org.apache.hadoop.util.RunJar.main(RunJar.java:236) > {code} > I don't know what's wrong about it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HIVE-24199) Incorrect result when subquey in exists contains limit
[ https://issues.apache.org/jira/browse/HIVE-24199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Kasa resolved HIVE-24199. --- Resolution: Fixed Pushed to master, thanks [~vgarg] for review. > Incorrect result when subquey in exists contains limit > -- > > Key: HIVE-24199 > URL: https://issues.apache.org/jira/browse/HIVE-24199 > Project: Hive > Issue Type: Bug >Reporter: Krisztian Kasa >Assignee: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > {code:java} > create table web_sales (ws_order_number int, ws_warehouse_sk int) stored as > orc; > insert into web_sales values > (1, 1), > (1, 2), > (2, 1), > (2, 2); > select * from web_sales ws1 > where exists (select 1 from web_sales ws2 where ws1.ws_order_number = > ws2.ws_order_number limit 1); > 1 1 > 1 2 > {code} > {code:java} > CBO PLAN: > HiveSemiJoin(condition=[=($0, $2)], joinType=[semi]) > HiveProject(ws_order_number=[$0], ws_warehouse_sk=[$1]) > HiveFilter(condition=[IS NOT NULL($0)]) > HiveTableScan(table=[[default, web_sales]], table:alias=[ws1]) > HiveProject(ws_order_number=[$0]) > HiveSortLimit(fetch=[1]) <-- This shouldn't be added > HiveProject(ws_order_number=[$0]) > HiveFilter(condition=[IS NOT NULL($0)]) > HiveTableScan(table=[[default, web_sales]], table:alias=[ws2]) > {code} > Limit n on the right side of the join reduces the result set coming from the > right to only n record hence not all the ws_order_number values are included > which leads to correctness issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24199) Incorrect result when subquey in exists contains limit
[ https://issues.apache.org/jira/browse/HIVE-24199?focusedWorklogId=496765=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496765 ] ASF GitHub Bot logged work on HIVE-24199: - Author: ASF GitHub Bot Created on: 07/Oct/20 17:15 Start Date: 07/Oct/20 17:15 Worklog Time Spent: 10m Work Description: kasakrisz merged pull request #1525: URL: https://github.com/apache/hive/pull/1525 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496765) Time Spent: 50m (was: 40m) > Incorrect result when subquey in exists contains limit > -- > > Key: HIVE-24199 > URL: https://issues.apache.org/jira/browse/HIVE-24199 > Project: Hive > Issue Type: Bug >Reporter: Krisztian Kasa >Assignee: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > {code:java} > create table web_sales (ws_order_number int, ws_warehouse_sk int) stored as > orc; > insert into web_sales values > (1, 1), > (1, 2), > (2, 1), > (2, 2); > select * from web_sales ws1 > where exists (select 1 from web_sales ws2 where ws1.ws_order_number = > ws2.ws_order_number limit 1); > 1 1 > 1 2 > {code} > {code:java} > CBO PLAN: > HiveSemiJoin(condition=[=($0, $2)], joinType=[semi]) > HiveProject(ws_order_number=[$0], ws_warehouse_sk=[$1]) > HiveFilter(condition=[IS NOT NULL($0)]) > HiveTableScan(table=[[default, web_sales]], table:alias=[ws1]) > HiveProject(ws_order_number=[$0]) > HiveSortLimit(fetch=[1]) <-- This shouldn't be added > HiveProject(ws_order_number=[$0]) > HiveFilter(condition=[IS NOT NULL($0)]) > HiveTableScan(table=[[default, web_sales]], table:alias=[ws2]) > {code} > Limit n on the right side of the join reduces the result set coming from the > right to only n record hence not all the ws_order_number values are included > which leads to correctness issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-24040) Slightly odd behaviour with CHAR comparisons and string literals
[ https://issues.apache.org/jira/browse/HIVE-24040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209672#comment-17209672 ] Tim Armstrong commented on HIVE-24040: -- [~kgyrtkirk] I'd recommend reading http://databasearchitects.blogspot.com/2015/01/fun-with-char.html for an interesting perspective on this (one of its conclusions is that Postgres and other systems do not implement the spec exactly, and that may be a good thing). > Slightly odd behaviour with CHAR comparisons and string literals > > > Key: HIVE-24040 > URL: https://issues.apache.org/jira/browse/HIVE-24040 > Project: Hive > Issue Type: Bug >Reporter: Tim Armstrong >Priority: Major > > If t is a char column, this statement behaves a bit strangely - since the RHS > is a STRING, I would have expected it to behave consistently with other > CHAR/STRING comparisons, where the CHAR column has its trailing spaces > removed and the STRING does not have its trailing spaces removed. > {noformat} > select count(*) from ax where t = cast('a ' as string); > {noformat} > Instead it seems to be treated the same as if it was a plain literal, > interpreted as CHAR, i.e. > {noformat} > select count(*) from ax where t = 'a '; > {noformat} > Here are some more experiments I did based on > https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/in_typecheck_char.q > that seem to show some inconsistencies. > {noformat} > -- Hive version 3.1.3000.7.2.1.0-287 r4e72e59f1c2a51a64e0ff37b14bd396cd4e97b98 > create table ax(s char(1),t char(10)); > insert into ax values ('a','a'),('a','a '),('b','bb'); > -- varchar literal preserves trailing space > select count(*) from ax where t = cast('a ' as varchar(50)); > +--+ > | _c0 | > +--+ > | 0| > +--+ > -- explicit cast of literal to string removes trailing space > select count(*) from ax where t = cast('a ' as string); > +--+ > | _c0 | > +--+ > | 2| > +--+ > -- other string expressions preserve trailing space > select count(*) from ax where t = concat('a', ' '); > +--+ > | _c0 | > +--+ > | 0| > +--+ > -- varchar col preserves trailing space > create table stringv as select cast('a ' as varchar(50)); > select count(*) from ax, stringv where t = `_c0`; > +--+ > | _c0 | > +--+ > | 0| > +--+ > -- string col preserves trailing space > create table stringa as select 'a '; > select count(*) from ax, stringa where t = `_c0`; > +--+ > | _c0 | > +--+ > | 0| > +--+ > {noformat} > [~jcamachorodriguez] [~kgyrtkirk] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HIVE-24229) DirectSql fails in case of OracleDB
[ https://issues.apache.org/jira/browse/HIVE-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Haindrich resolved HIVE-24229. - Fix Version/s: 4.0.0 Resolution: Fixed Merged into master. Thank you [~ayushtkn]! > DirectSql fails in case of OracleDB > --- > > Key: HIVE-24229 > URL: https://issues.apache.org/jira/browse/HIVE-24229 > Project: Hive > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Direct Sql fails due to different data type mapping incase of Oracle DB -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24229) DirectSql fails in case of OracleDB
[ https://issues.apache.org/jira/browse/HIVE-24229?focusedWorklogId=496733=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496733 ] ASF GitHub Bot logged work on HIVE-24229: - Author: ASF GitHub Bot Created on: 07/Oct/20 16:32 Start Date: 07/Oct/20 16:32 Worklog Time Spent: 10m Work Description: kgyrtkirk merged pull request #1552: URL: https://github.com/apache/hive/pull/1552 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496733) Time Spent: 40m (was: 0.5h) > DirectSql fails in case of OracleDB > --- > > Key: HIVE-24229 > URL: https://issues.apache.org/jira/browse/HIVE-24229 > Project: Hive > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Direct Sql fails due to different data type mapping incase of Oracle DB -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=496724=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496724 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 07/Oct/20 16:17 Start Date: 07/Oct/20 16:17 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1415: URL: https://github.com/apache/hive/pull/1415#discussion_r501141180 ## File path: standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/txn/CompactionTxnHandler.java ## @@ -386,15 +427,27 @@ public void markCleaned(CompactionInfo info) throws MetaException { pStmt.setLong(paramCount++, info.highestWriteId); } LOG.debug("Going to execute update <" + s + ">"); -if (pStmt.executeUpdate() < 1) { - LOG.error("Expected to remove at least one row from completed_txn_components when " + -"marking compaction entry as clean!"); +if ((updCount = pStmt.executeUpdate()) < 1) { + // In the case of clean abort commit hasn't happened so completed_txn_components hasn't been filled + if (!info.isCleanAbortedCompaction()) { +LOG.error( +"Expected to remove at least one row from completed_txn_components when " ++ "marking compaction entry as clean!"); + } } s = "select distinct txn_id from TXNS, TXN_COMPONENTS where txn_id = tc_txnid and txn_state = '" + TXN_ABORTED + "' and tc_database = ? and tc_table = ?"; if (info.highestWriteId != 0) s += " and tc_writeid <= ?"; if (info.partName != null) s += " and tc_partition = ?"; +if (info.writeIds != null && info.writeIds.size() > 0) { + String[] wriStr = new String[info.writeIds.size()]; + int i = 0; + for (Long writeId: writeIds) { +wriStr[i++] = writeId.toString(); + } + s += " and tc_writeid in (" + String.join(",", wriStr) + ")"; Review comment: is this even used, statement was already compiled? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496724) Time Spent: 6.5h (was: 6h 20m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 6.5h > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24236) Connection leak in TxnHandler
[ https://issues.apache.org/jira/browse/HIVE-24236?focusedWorklogId=496693=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496693 ] ASF GitHub Bot logged work on HIVE-24236: - Author: ASF GitHub Bot Created on: 07/Oct/20 15:43 Start Date: 07/Oct/20 15:43 Worklog Time Spent: 10m Work Description: yongzhi commented on pull request #1559: URL: https://github.com/apache/hive/pull/1559#issuecomment-705023401 recheck This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496693) Time Spent: 1h 10m (was: 1h) > Connection leak in TxnHandler > - > > Key: HIVE-24236 > URL: https://issues.apache.org/jira/browse/HIVE-24236 > Project: Hive > Issue Type: Bug > Components: Metastore >Reporter: Yongzhi Chen >Assignee: Yongzhi Chen >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > We see failures in QE tests with cannot allocate connections errors. The > exception stack like following: > {noformat} > 2020-09-29T18:44:26,563 INFO [Heartbeater-0]: txn.TxnHandler > (TxnHandler.java:checkRetryable(3733)) - Non-retryable error in > heartbeat(HeartbeatRequest(lockid:0, txnid:11908)) : Cannot get a connection, > general error (SQLState=null, ErrorCode=0) > 2020-09-29T18:44:26,564 ERROR [Heartbeater-0]: metastore.RetryingHMSHandler > (RetryingHMSHandler.java:invokeInternal(201)) - MetaException(message:Unable > to select from transaction database > org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, general > error > at > org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:118) > at > org.apache.hadoop.hive.metastore.txn.TxnHandler.getDbConn(TxnHandler.java:3605) > at > org.apache.hadoop.hive.metastore.txn.TxnHandler.getDbConn(TxnHandler.java:3598) > at > org.apache.hadoop.hive.metastore.txn.TxnHandler.heartbeat(TxnHandler.java:2739) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.heartbeat(HiveMetaStore.java:8452) > at sun.reflect.GeneratedMethodAccessor415.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:147) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:108) > at com.sun.proxy.$Proxy63.heartbeat(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.heartbeat(HiveMetaStoreClient.java:3247) > at sun.reflect.GeneratedMethodAccessor414.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:213) > at com.sun.proxy.$Proxy64.heartbeat(Unknown Source) > at > org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.heartbeat(DbTxnManager.java:671) > at > org.apache.hadoop.hive.ql.lockmgr.DbTxnManager$Heartbeater.lambda$run$0(DbTxnManager.java:1102) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) > at > org.apache.hadoop.hive.ql.lockmgr.DbTxnManager$Heartbeater.run(DbTxnManager.java:1101) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at >
[jira] [Work logged] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator
[ https://issues.apache.org/jira/browse/HIVE-24203?focusedWorklogId=496688=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496688 ] ASF GitHub Bot logged work on HIVE-24203: - Author: ASF GitHub Bot Created on: 07/Oct/20 15:35 Start Date: 07/Oct/20 15:35 Worklog Time Spent: 10m Work Description: okumin commented on a change in pull request #1531: URL: https://github.com/apache/hive/pull/1531#discussion_r501110922 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java ## @@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, } } + /** + * LateralViewJoinOperator changes the data size and column level statistics. + * + * A diagram of LATERAL VIEW. + * + * [Lateral View Forward] + * / \ + *[Select] [Select] + *|| + *| [UDTF] + *\ / + * [Lateral View Join] + * + * For each row of the source, the left branch just picks columns and the right branch processes UDTF. + * And then LVJ joins a row from the left branch with rows from the right branch. + * The join has one-to-many relationship since UDTF can generate multiple rows. + * + * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides. + */ + public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor { +@Override +public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, + Object... nodeOutputs) throws SemanticException { + final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd; + final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx; + final HiveConf conf = aspCtx.getConf(); + + if (!isAllParentsContainStatistics(lop)) { +return null; + } + + final List> parents = lop.getParentOperators(); + if (parents.size() != 2) { +LOG.warn("LateralViewJoinOperator should have just two parents but actually has " ++ parents.size() + " parents."); +return null; + } + + final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics(); + final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics(); + + final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows(); + final long selectDataSize = StatsUtils.safeMult(selectStats.getDataSize(), factor); + final long dataSize = StatsUtils.safeAdd(selectDataSize, udtfStats.getDataSize()); + Statistics joinedStats = new Statistics(udtfStats.getNumRows(), dataSize, 0, 0); + + if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) { +final Map columnExprMap = lop.getColumnExprMap(); +final RowSchema schema = lop.getSchema(); + +joinedStats.updateColumnStatsState(selectStats.getColumnStatsState()); +final List selectColStats = StatsUtils +.getColStatisticsFromExprMap(conf, selectStats, columnExprMap, schema); +joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor)); + +joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState()); +final List udtfColStats = StatsUtils +.getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, schema); +joinedStats.addToColumnStats(udtfColStats); + +joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop); +lop.setStatistics(joinedStats); + +if (LOG.isDebugEnabled()) { + LOG.debug("[0] STATS-" + lop.toString() + ": " + joinedStats.extendedToString()); +} + } else { +joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop); +lop.setStatistics(joinedStats); + +if (LOG.isDebugEnabled()) { + LOG.debug("[1] STATS-" + lop.toString() + ": " + joinedStats.extendedToString()); +} + } + return null; +} + +private List multiplyColStats(List colStatistics, double factor) { + for (ColStatistics colStats : colStatistics) { +colStats.setNumFalses(StatsUtils.safeMult(colStats.getNumFalses(), factor)); +colStats.setNumTrues(StatsUtils.safeMult(colStats.getNumTrues(), factor)); +colStats.setNumNulls(StatsUtils.safeMult(colStats.getNumNulls(), factor)); +// When factor > 1, the same records are duplicated and countDistinct never changes. +if (factor < 1.0) { + colStats.setCountDistint(StatsUtils.safeMult(colStats.getCountDistint(), factor)); Review comment: This method may include additional logging and logics to optimize JOIN such as
[jira] [Work logged] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator
[ https://issues.apache.org/jira/browse/HIVE-24203?focusedWorklogId=496681=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496681 ] ASF GitHub Bot logged work on HIVE-24203: - Author: ASF GitHub Bot Created on: 07/Oct/20 15:23 Start Date: 07/Oct/20 15:23 Worklog Time Spent: 10m Work Description: okumin commented on a change in pull request #1531: URL: https://github.com/apache/hive/pull/1531#discussion_r501101794 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java ## @@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, } } + /** + * LateralViewJoinOperator changes the data size and column level statistics. + * + * A diagram of LATERAL VIEW. + * + * [Lateral View Forward] + * / \ + *[Select] [Select] + *|| + *| [UDTF] + *\ / + * [Lateral View Join] + * + * For each row of the source, the left branch just picks columns and the right branch processes UDTF. + * And then LVJ joins a row from the left branch with rows from the right branch. + * The join has one-to-many relationship since UDTF can generate multiple rows. + * + * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides. + */ + public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor { +@Override +public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, + Object... nodeOutputs) throws SemanticException { + final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd; + final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx; + final HiveConf conf = aspCtx.getConf(); + + if (!isAllParentsContainStatistics(lop)) { +return null; + } + + final List> parents = lop.getParentOperators(); + if (parents.size() != 2) { +LOG.warn("LateralViewJoinOperator should have just two parents but actually has " ++ parents.size() + " parents."); +return null; + } + + final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics(); + final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics(); + + final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows(); + final long selectDataSize = StatsUtils.safeMult(selectStats.getDataSize(), factor); + final long dataSize = StatsUtils.safeAdd(selectDataSize, udtfStats.getDataSize()); + Statistics joinedStats = new Statistics(udtfStats.getNumRows(), dataSize, 0, 0); + + if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) { +final Map columnExprMap = lop.getColumnExprMap(); +final RowSchema schema = lop.getSchema(); + +joinedStats.updateColumnStatsState(selectStats.getColumnStatsState()); +final List selectColStats = StatsUtils +.getColStatisticsFromExprMap(conf, selectStats, columnExprMap, schema); +joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor)); + +joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState()); +final List udtfColStats = StatsUtils +.getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, schema); +joinedStats.addToColumnStats(udtfColStats); + +joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop); +lop.setStatistics(joinedStats); + +if (LOG.isDebugEnabled()) { + LOG.debug("[0] STATS-" + lop.toString() + ": " + joinedStats.extendedToString()); +} Review comment: I wonder if we should switch `[0]` or `[1]` based on a condition. I can see some rules use a different marker based on maybe the existence of column stats. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496681) Time Spent: 1h 40m (was: 1.5h) > Implement stats annotation rule for the LateralViewJoinOperator > --- > > Key: HIVE-24203 > URL: https://issues.apache.org/jira/browse/HIVE-24203 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Affects Versions: 4.0.0, 3.1.2, 2.3.7 >Reporter: okumin >
[jira] [Work logged] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator
[ https://issues.apache.org/jira/browse/HIVE-24203?focusedWorklogId=496677=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496677 ] ASF GitHub Bot logged work on HIVE-24203: - Author: ASF GitHub Bot Created on: 07/Oct/20 15:17 Start Date: 07/Oct/20 15:17 Worklog Time Spent: 10m Work Description: okumin commented on a change in pull request #1531: URL: https://github.com/apache/hive/pull/1531#discussion_r501096999 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java ## @@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, } } + /** + * LateralViewJoinOperator changes the data size and column level statistics. + * + * A diagram of LATERAL VIEW. + * + * [Lateral View Forward] + * / \ + *[Select] [Select] + *|| + *| [UDTF] + *\ / + * [Lateral View Join] + * + * For each row of the source, the left branch just picks columns and the right branch processes UDTF. + * And then LVJ joins a row from the left branch with rows from the right branch. + * The join has one-to-many relationship since UDTF can generate multiple rows. + * + * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides. + */ + public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor { +@Override +public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, + Object... nodeOutputs) throws SemanticException { + final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd; + final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx; + final HiveConf conf = aspCtx.getConf(); + + if (!isAllParentsContainStatistics(lop)) { +return null; + } + + final List> parents = lop.getParentOperators(); + if (parents.size() != 2) { +LOG.warn("LateralViewJoinOperator should have just two parents but actually has " ++ parents.size() + " parents."); +return null; + } + + final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics(); + final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics(); + + final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows(); + final long selectDataSize = StatsUtils.safeMult(selectStats.getDataSize(), factor); + final long dataSize = StatsUtils.safeAdd(selectDataSize, udtfStats.getDataSize()); + Statistics joinedStats = new Statistics(udtfStats.getNumRows(), dataSize, 0, 0); + + if (satisfyPrecondition(selectStats) && satisfyPrecondition(udtfStats)) { +final Map columnExprMap = lop.getColumnExprMap(); +final RowSchema schema = lop.getSchema(); + +joinedStats.updateColumnStatsState(selectStats.getColumnStatsState()); +final List selectColStats = StatsUtils +.getColStatisticsFromExprMap(conf, selectStats, columnExprMap, schema); +joinedStats.addToColumnStats(multiplyColStats(selectColStats, factor)); + +joinedStats.updateColumnStatsState(udtfStats.getColumnStatsState()); +final List udtfColStats = StatsUtils +.getColStatisticsFromExprMap(conf, udtfStats, columnExprMap, schema); +joinedStats.addToColumnStats(udtfColStats); + +joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop); +lop.setStatistics(joinedStats); + +if (LOG.isDebugEnabled()) { + LOG.debug("[0] STATS-" + lop.toString() + ": " + joinedStats.extendedToString()); +} + } else { +joinedStats = applyRuntimeStats(aspCtx.getParseContext().getContext(), joinedStats, lop); +lop.setStatistics(joinedStats); + +if (LOG.isDebugEnabled()) { + LOG.debug("[1] STATS-" + lop.toString() + ": " + joinedStats.extendedToString()); +} + } + return null; +} + +private List multiplyColStats(List colStatistics, double factor) { + for (ColStatistics colStats : colStatistics) { +colStats.setNumFalses(StatsUtils.safeMult(colStats.getNumFalses(), factor)); +colStats.setNumTrues(StatsUtils.safeMult(colStats.getNumTrues(), factor)); +colStats.setNumNulls(StatsUtils.safeMult(colStats.getNumNulls(), factor)); +// When factor > 1, the same records are duplicated and countDistinct never changes. +if (factor < 1.0) { + colStats.setCountDistint(StatsUtils.safeMult(colStats.getCountDistint(), factor)); Review comment: Now I think this is available for this purpose if we add updating num trues and
[jira] [Commented] (HIVE-24040) Slightly odd behaviour with CHAR comparisons and string literals
[ https://issues.apache.org/jira/browse/HIVE-24040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209610#comment-17209610 ] Zoltan Haindrich commented on HIVE-24040: - {code} select cast('a' as char(10)) = cast('a ' as varchar(50)) {code} in psql I got some interesting results: {code} select length(cast('a ' as varchar(10))),length(cast('a ' as char(10) ) ),cast('a ' as varchar(10))=cast('a ' as char(10) ); length | length | ?column? ++-- 2 | 1 | t {code} in Hive for the above case the comparision should happen in "string" for which the lengths are different => will not match {code} select length(cast(cast('a' as char(10)) as string)),length(cast(cast('a ' as varchar(50)) as string)) +--+--+ | _c0 | _c1 | +--+--+ | 1| 2| +--+--+ {code} I feel that this is somewhere in the gray zone...will dig into the sql specs... > Slightly odd behaviour with CHAR comparisons and string literals > > > Key: HIVE-24040 > URL: https://issues.apache.org/jira/browse/HIVE-24040 > Project: Hive > Issue Type: Bug >Reporter: Tim Armstrong >Priority: Major > > If t is a char column, this statement behaves a bit strangely - since the RHS > is a STRING, I would have expected it to behave consistently with other > CHAR/STRING comparisons, where the CHAR column has its trailing spaces > removed and the STRING does not have its trailing spaces removed. > {noformat} > select count(*) from ax where t = cast('a ' as string); > {noformat} > Instead it seems to be treated the same as if it was a plain literal, > interpreted as CHAR, i.e. > {noformat} > select count(*) from ax where t = 'a '; > {noformat} > Here are some more experiments I did based on > https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/in_typecheck_char.q > that seem to show some inconsistencies. > {noformat} > -- Hive version 3.1.3000.7.2.1.0-287 r4e72e59f1c2a51a64e0ff37b14bd396cd4e97b98 > create table ax(s char(1),t char(10)); > insert into ax values ('a','a'),('a','a '),('b','bb'); > -- varchar literal preserves trailing space > select count(*) from ax where t = cast('a ' as varchar(50)); > +--+ > | _c0 | > +--+ > | 0| > +--+ > -- explicit cast of literal to string removes trailing space > select count(*) from ax where t = cast('a ' as string); > +--+ > | _c0 | > +--+ > | 2| > +--+ > -- other string expressions preserve trailing space > select count(*) from ax where t = concat('a', ' '); > +--+ > | _c0 | > +--+ > | 0| > +--+ > -- varchar col preserves trailing space > create table stringv as select cast('a ' as varchar(50)); > select count(*) from ax, stringv where t = `_c0`; > +--+ > | _c0 | > +--+ > | 0| > +--+ > -- string col preserves trailing space > create table stringa as select 'a '; > select count(*) from ax, stringa where t = `_c0`; > +--+ > | _c0 | > +--+ > | 0| > +--+ > {noformat} > [~jcamachorodriguez] [~kgyrtkirk] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator
[ https://issues.apache.org/jira/browse/HIVE-24203?focusedWorklogId=496654=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496654 ] ASF GitHub Bot logged work on HIVE-24203: - Author: ASF GitHub Bot Created on: 07/Oct/20 14:53 Start Date: 07/Oct/20 14:53 Worklog Time Spent: 10m Work Description: okumin commented on a change in pull request #1531: URL: https://github.com/apache/hive/pull/1531#discussion_r501078148 ## File path: ql/src/test/results/clientpositive/llap/annotate_stats_lateral_view_join.q.out ## @@ -503,14 +503,14 @@ STAGE PLANS: Statistics: Num rows: 1 Data size: 376 Basic stats: COMPLETE Column stats: COMPLETE Lateral View Join Operator outputColumnNames: _col0, _col1, _col5, _col6 - Statistics: Num rows: 0 Data size: 24 Basic stats: PARTIAL Column stats: NONE + Statistics: Num rows: 0 Data size: 24 Basic stats: PARTIAL Column stats: COMPLETE Review comment: This is an edge case since `HIVE_STATS_UDTF_FACTOR` is greater than or equal to 1. Anyway, I created a ticket. https://issues.apache.org/jira/browse/HIVE-24240 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496654) Time Spent: 1h 20m (was: 1h 10m) > Implement stats annotation rule for the LateralViewJoinOperator > --- > > Key: HIVE-24203 > URL: https://issues.apache.org/jira/browse/HIVE-24203 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Affects Versions: 4.0.0, 3.1.2, 2.3.7 >Reporter: okumin >Assignee: okumin >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > StatsRulesProcFactory doesn't have any rules to handle a JOIN by LATERAL VIEW. > This can cause an underestimation in case that UDTF in LATERAL VIEW generates > multiple rows. > HIVE-20262 has already added the rule for UDTF. > This issue would add the rule for LateralViewJoinOperator. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HIVE-24240) Implement missing features in UDTFStatsRule
[ https://issues.apache.org/jira/browse/HIVE-24240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] okumin reassigned HIVE-24240: - > Implement missing features in UDTFStatsRule > --- > > Key: HIVE-24240 > URL: https://issues.apache.org/jira/browse/HIVE-24240 > Project: Hive > Issue Type: Improvement >Affects Versions: 4.0.0 >Reporter: okumin >Assignee: okumin >Priority: Major > > Add the following steps. > * Handle the case in which the num row will be zero > * Compute runtime stats in case of a re-execution -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-23667) Incorrect output with option hive.auto.convert.join=fasle
[ https://issues.apache.org/jira/browse/HIVE-23667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209579#comment-17209579 ] Zoltan Haindrich commented on HIVE-23667: - could you please give a complete example to reproduce the issue? > Incorrect output with option hive.auto.convert.join=fasle > - > > Key: HIVE-23667 > URL: https://issues.apache.org/jira/browse/HIVE-23667 > Project: Hive > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: gaozhan ding >Priority: Critical > > We use hive with version 3.1.0 with tez engine 0.9.1.3 > I encountered an error when executing a hive SQL. This SQL is as follows > {code:java} > set mapreduce.job.queuename=root.xxx; > set hive.exec.dynamic.partition.mode=nonstrict; > set hive.exec.dynamic.partition=true; > set hive.exec.max.dynamic.partitions.pernode=1; > set hive.exec.max.dynamic.partitions=1; > set hive.fileformat.check=false; > set mapred.reduce.tasks=50; > set hive.auto.convert.join=true; > use xxx; > select count(*) from 230_dim_site join dw_fact_inverter_detail on > dw_fact_inverter_detail.site=230_dim_site.id;{code} > with the output. > {code:java} > +--+ | _c0 | +--+ | 4954736 | +--+ > {code} > But when the hive.auto.convert.join option is set to false,the utput is not > as expected。 > The SQL is as follows > {code:java} > set mapreduce.job.queuename=root.xxx; > set hive.exec.dynamic.partition.mode=nonstrict; > set hive.exec.dynamic.partition=true; > set hive.exec.max.dynamic.partitions.pernode=1; > set hive.exec.max.dynamic.partitions=1; > set hive.fileformat.check=false; > set mapred.reduce.tasks=50; > set hive.auto.convert.join=false; //changed > use xxx; > select count(*) from 230_dim_site join dw_fact_inverter_detail on > dw_fact_inverter_detail.site=230_dim_site.id;{code} > with output: > {code:java} > +--+ | _c0 | +--+ | 0 | +--+ > {code} > Beside,both tables participating in the join are partition tables. > Especially,if the option mapred.reduce.tasks=50 was not set,all above the sql > output expected results. > We just upgraded hive from 1.2 to 3.1.0, and we found that these problems > only occurred in the old hive table. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator
[ https://issues.apache.org/jira/browse/HIVE-24203?focusedWorklogId=496615=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496615 ] ASF GitHub Bot logged work on HIVE-24203: - Author: ASF GitHub Bot Created on: 07/Oct/20 13:51 Start Date: 07/Oct/20 13:51 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on a change in pull request #1531: URL: https://github.com/apache/hive/pull/1531#discussion_r500976202 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java ## @@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, } } + /** + * LateralViewJoinOperator changes the data size and column level statistics. + * + * A diagram of LATERAL VIEW. + * + * [Lateral View Forward] + * / \ + *[Select] [Select] + *|| + *| [UDTF] + *\ / + * [Lateral View Join] + * + * For each row of the source, the left branch just picks columns and the right branch processes UDTF. + * And then LVJ joins a row from the left branch with rows from the right branch. + * The join has one-to-many relationship since UDTF can generate multiple rows. + * + * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides. + */ + public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor { +@Override +public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, + Object... nodeOutputs) throws SemanticException { + final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd; + final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx; + final HiveConf conf = aspCtx.getConf(); + + if (!isAllParentsContainStatistics(lop)) { +return null; + } + + final List> parents = lop.getParentOperators(); + if (parents.size() != 2) { +LOG.warn("LateralViewJoinOperator should have just two parents but actually has " ++ parents.size() + " parents."); +return null; + } + + final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics(); + final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics(); + + final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows(); Review comment: I know `selectStats.getNumRows()` should not be zero - but just in case... could you also add the resulting logic as `StatsUtils` or something like that? ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java ## @@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, } } + /** + * LateralViewJoinOperator changes the data size and column level statistics. + * + * A diagram of LATERAL VIEW. + * + * [Lateral View Forward] + * / \ + *[Select] [Select] + *|| + *| [UDTF] + *\ / + * [Lateral View Join] + * + * For each row of the source, the left branch just picks columns and the right branch processes UDTF. + * And then LVJ joins a row from the left branch with rows from the right branch. + * The join has one-to-many relationship since UDTF can generate multiple rows. + * + * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides. + */ + public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor { +@Override +public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, + Object... nodeOutputs) throws SemanticException { + final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd; + final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx; + final HiveConf conf = aspCtx.getConf(); + + if (!isAllParentsContainStatistics(lop)) { +return null; + } + + final List> parents = lop.getParentOperators(); + if (parents.size() != 2) { +LOG.warn("LateralViewJoinOperator should have just two parents but actually has " ++ parents.size() + " parents."); +return null; + } + + final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics(); + final Statistics udtfStats = parents.get(LateralViewJoinOperator.UDTF_TAG).getStatistics(); + + final double factor = (double) udtfStats.getNumRows() / (double) selectStats.getNumRows(); + final long selectDataSize = StatsUtils.safeMult(selectStats.getDataSize(), factor); + final
[jira] [Work logged] (HIVE-24229) DirectSql fails in case of OracleDB
[ https://issues.apache.org/jira/browse/HIVE-24229?focusedWorklogId=496559=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496559 ] ASF GitHub Bot logged work on HIVE-24229: - Author: ASF GitHub Bot Created on: 07/Oct/20 12:47 Start Date: 07/Oct/20 12:47 Worklog Time Spent: 10m Work Description: ayushtkn commented on pull request #1552: URL: https://github.com/apache/hive/pull/1552#issuecomment-704911416 Yes, This gets surface in an internal test when run on oracle DB, The table had a partition of type int, and I tried to access that using spark, using an extension. something like- `sql(“select * from store_sales where ss_store_sk=10”).show` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496559) Time Spent: 0.5h (was: 20m) > DirectSql fails in case of OracleDB > --- > > Key: HIVE-24229 > URL: https://issues.apache.org/jira/browse/HIVE-24229 > Project: Hive > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Direct Sql fails due to different data type mapping incase of Oracle DB -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24229) DirectSql fails in case of OracleDB
[ https://issues.apache.org/jira/browse/HIVE-24229?focusedWorklogId=496542=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496542 ] ASF GitHub Bot logged work on HIVE-24229: - Author: ASF GitHub Bot Created on: 07/Oct/20 12:22 Start Date: 07/Oct/20 12:22 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on pull request #1552: URL: https://github.com/apache/hive/pull/1552#issuecomment-704898236 this "clob" stuff keeps coming back again-and-again... do you have a way to reproduce the issue? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496542) Time Spent: 20m (was: 10m) > DirectSql fails in case of OracleDB > --- > > Key: HIVE-24229 > URL: https://issues.apache.org/jira/browse/HIVE-24229 > Project: Hive > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Direct Sql fails due to different data type mapping incase of Oracle DB -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23800) Add hooks when HiveServer2 stops due to OutOfMemoryError
[ https://issues.apache.org/jira/browse/HIVE-23800?focusedWorklogId=496538=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496538 ] ASF GitHub Bot logged work on HIVE-23800: - Author: ASF GitHub Bot Created on: 07/Oct/20 12:14 Start Date: 07/Oct/20 12:14 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on a change in pull request #1205: URL: https://github.com/apache/hive/pull/1205#discussion_r500961287 ## File path: ql/src/java/org/apache/hadoop/hive/ql/HookRunner.java ## @@ -39,57 +36,27 @@ import org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHook; import org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHookContext; import org.apache.hadoop.hive.ql.session.SessionState; -import org.apache.hadoop.hive.ql.session.SessionState.LogHelper; import org.apache.hive.common.util.HiveStringUtils; +import static org.apache.hadoop.hive.ql.hooks.HookContext.HookType.*; + /** * Handles hook executions for {@link Driver}. */ public class HookRunner { private static final String CLASS_NAME = Driver.class.getName(); private final HiveConf conf; - private LogHelper console; - private List queryHooks = new ArrayList<>(); - private List saHooks = new ArrayList<>(); - private List driverRunHooks = new ArrayList<>(); - private List preExecHooks = new ArrayList<>(); - private List postExecHooks = new ArrayList<>(); - private List onFailureHooks = new ArrayList<>(); - private boolean initialized = false; + private final HooksLoader loader; Review comment: this is great! since from now on we can also dynamically add new hooks to it at runtime - we may rename it from "Loader" to something else. ## File path: ql/src/java/org/apache/hadoop/hive/ql/hooks/HookContext.java ## @@ -45,7 +47,50 @@ public class HookContext { static public enum HookType { -PRE_EXEC_HOOK, POST_EXEC_HOOK, ON_FAILURE_HOOK + Review comment: I like this approach - could you make a small check: * if we have hook compiled for the old api (which uses say the enum key `HookType.PRE_EXEC_HOOK`) * will it work or not (without recompilation) with the new implementation This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496538) Time Spent: 5h 10m (was: 5h) > Add hooks when HiveServer2 stops due to OutOfMemoryError > > > Key: HIVE-23800 > URL: https://issues.apache.org/jira/browse/HIVE-23800 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Reporter: Zhihua Deng >Priority: Minor > Labels: pull-request-available > Time Spent: 5h 10m > Remaining Estimate: 0h > > Make oom hook an interface of HiveServer2, so user can implement the hook to > do something before HS2 stops, such as dumping the heap or altering the > devops. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23851) MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions
[ https://issues.apache.org/jira/browse/HIVE-23851?focusedWorklogId=496526=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496526 ] ASF GitHub Bot logged work on HIVE-23851: - Author: ASF GitHub Bot Created on: 07/Oct/20 12:03 Start Date: 07/Oct/20 12:03 Worklog Time Spent: 10m Work Description: kgyrtkirk edited a comment on pull request #1271: URL: https://github.com/apache/hive/pull/1271#issuecomment-704887635 first of all; sorry for being very slow to respond - there were a bunch of things (renovation things :D) ...things look better now; so I'll be more likely to respond in a reasonbable timeframe :) I now wonder what's the benefit of this kryo stuff...I think there is no client in the world which could really use that correctly - I think we even bind our metastore/hive versions together - since it uses some internal ql classes inside the kryo byte array what do you think about the following - would it be possibile: * remove(or at least deprecate) the `byte[]` kryo stuff from the thrift api * replace it with the string based approach... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496526) Time Spent: 4h 50m (was: 4h 40m) > MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions > > > Key: HIVE-23851 > URL: https://issues.apache.org/jira/browse/HIVE-23851 > Project: Hive > Issue Type: Bug >Affects Versions: 4.0.0 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > *Steps to reproduce:* > # Create external table > # Run msck command to sync all the partitions with metastore > # Remove one of the partition path > # Run msck repair with partition filtering > *Stack Trace:* > {code:java} > 2020-07-15T02:10:29,045 ERROR [4dad298b-28b1-4e6b-94b6-aa785b60c576 main] > ppr.PartitionExpressionForMetastore: Failed to deserialize the expression > java.lang.IndexOutOfBoundsException: Index: 110, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:657) ~[?:1.8.0_192] > at java.util.ArrayList.get(ArrayList.java:433) ~[?:1.8.0_192] > at > org.apache.hive.com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hive.com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:707) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:211) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeObjectFromKryo(SerializationUtilities.java:806) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpressionFromKryo(SerializationUtilities.java:775) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.deserializeExpr(PartitionExpressionForMetastore.java:96) > [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.convertExprToFilter(PartitionExpressionForMetastore.java:52) > [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.PartFilterExprUtil.makeExpressionTree(PartFilterExprUtil.java:48) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:3593) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:80) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT-tests.jar:4.0.0-SNAPSHOT] > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_192] > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:1.8.0_192] > {code} > *Cause:* > In case of msck repair with partition filtering we expect expression proxy
[jira] [Work logged] (HIVE-23851) MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions
[ https://issues.apache.org/jira/browse/HIVE-23851?focusedWorklogId=496523=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496523 ] ASF GitHub Bot logged work on HIVE-23851: - Author: ASF GitHub Bot Created on: 07/Oct/20 12:01 Start Date: 07/Oct/20 12:01 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on pull request #1271: URL: https://github.com/apache/hive/pull/1271#issuecomment-704887635 I now wonder what's the benefit of this kryo stuff...I think there is no client in the world which could really use that correctly - I think we even bind our metastore/hive versions together - since it uses some internal ql classes inside the kryo byte array what do you think about the following - would it be possibile: * remove(or at least deprecate) the `byte[]` kryo stuff from the thrift api * replace it with the string based approach... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496523) Time Spent: 4h 40m (was: 4.5h) > MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions > > > Key: HIVE-23851 > URL: https://issues.apache.org/jira/browse/HIVE-23851 > Project: Hive > Issue Type: Bug >Affects Versions: 4.0.0 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > > *Steps to reproduce:* > # Create external table > # Run msck command to sync all the partitions with metastore > # Remove one of the partition path > # Run msck repair with partition filtering > *Stack Trace:* > {code:java} > 2020-07-15T02:10:29,045 ERROR [4dad298b-28b1-4e6b-94b6-aa785b60c576 main] > ppr.PartitionExpressionForMetastore: Failed to deserialize the expression > java.lang.IndexOutOfBoundsException: Index: 110, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:657) ~[?:1.8.0_192] > at java.util.ArrayList.get(ArrayList.java:433) ~[?:1.8.0_192] > at > org.apache.hive.com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hive.com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:707) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:211) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeObjectFromKryo(SerializationUtilities.java:806) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpressionFromKryo(SerializationUtilities.java:775) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.deserializeExpr(PartitionExpressionForMetastore.java:96) > [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.convertExprToFilter(PartitionExpressionForMetastore.java:52) > [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.PartFilterExprUtil.makeExpressionTree(PartFilterExprUtil.java:48) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:3593) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:80) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT-tests.jar:4.0.0-SNAPSHOT] > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_192] > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:1.8.0_192] > {code} > *Cause:* > In case of msck repair with partition filtering we expect expression proxy > class to be set as PartitionExpressionForMetastore ( > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckAnalyzer.java#L78 > ), While dropping partition we
[jira] [Work logged] (HIVE-24225) FIX S3A recordReader policy selection
[ https://issues.apache.org/jira/browse/HIVE-24225?focusedWorklogId=496489=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496489 ] ASF GitHub Bot logged work on HIVE-24225: - Author: ASF GitHub Bot Created on: 07/Oct/20 10:54 Start Date: 07/Oct/20 10:54 Worklog Time Spent: 10m Work Description: pgaref edited a comment on pull request #1547: URL: https://github.com/apache/hive/pull/1547#issuecomment-704856290 Hey @steveloughran --- the approach of the above patch was a bit off, one problem was that the Fs objects were lazility initalized and could end up throwing exceptions when setting the option eagerly. The most important issue was that the LLAP IO creates its own FS object (and the above where only used for output) so the option itself was not properly propagated. A solution for all this could be the S3A **openFileWithOptions** call that adds file options for the open File call instead of on the FS (still needs to add support for **fadvise** though) https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L4828 Talking about this TODO: https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L1136 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496489) Time Spent: 1h (was: 50m) > FIX S3A recordReader policy selection > - > > Key: HIVE-24225 > URL: https://issues.apache.org/jira/browse/HIVE-24225 > Project: Hive > Issue Type: Bug >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Dynamic S3A recordReader policy selection can cause issues on lazy > initialized FS objects -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24225) FIX S3A recordReader policy selection
[ https://issues.apache.org/jira/browse/HIVE-24225?focusedWorklogId=496487=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496487 ] ASF GitHub Bot logged work on HIVE-24225: - Author: ASF GitHub Bot Created on: 07/Oct/20 10:53 Start Date: 07/Oct/20 10:53 Worklog Time Spent: 10m Work Description: pgaref edited a comment on pull request #1547: URL: https://github.com/apache/hive/pull/1547#issuecomment-704856290 Hey @steveloughran --- the approach of the above patch was a bit off, one problem was that the Fs objects were lazility initalized and could end up throwing exceptions when setting the option eagerly. The most important issue was that the LLAP IO creates its own FS object (and the above where only used for output) so the option itself was not properly propagated. A solution for all this could be the S3A **openFileWithOptions** call that adds file options for the open File call instead of on the FS (still needs to add support for **fadvise** though) https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L4828 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496487) Time Spent: 50m (was: 40m) > FIX S3A recordReader policy selection > - > > Key: HIVE-24225 > URL: https://issues.apache.org/jira/browse/HIVE-24225 > Project: Hive > Issue Type: Bug >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Dynamic S3A recordReader policy selection can cause issues on lazy > initialized FS objects -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24225) FIX S3A recordReader policy selection
[ https://issues.apache.org/jira/browse/HIVE-24225?focusedWorklogId=496486=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496486 ] ASF GitHub Bot logged work on HIVE-24225: - Author: ASF GitHub Bot Created on: 07/Oct/20 10:52 Start Date: 07/Oct/20 10:52 Worklog Time Spent: 10m Work Description: pgaref commented on pull request #1547: URL: https://github.com/apache/hive/pull/1547#issuecomment-704856290 Hey @steveloughran --- the approach of the above patch was a bit off, one problem was that the Fs objects were lazility initalized and could end up throwing exceptions when setting the option eagerly. The most important issue was that the LLAP IO creates its own FS object (and the above where only used for output). A solution for all this could be the S3A **openFileWithOptions** call that adds file options for the open File call instead of on the FS (still needs to add support for **fadvise** though) https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L4828 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496486) Time Spent: 40m (was: 0.5h) > FIX S3A recordReader policy selection > - > > Key: HIVE-24225 > URL: https://issues.apache.org/jira/browse/HIVE-24225 > Project: Hive > Issue Type: Bug >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Dynamic S3A recordReader policy selection can cause issues on lazy > initialized FS objects -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24199) Incorrect result when subquey in exists contains limit
[ https://issues.apache.org/jira/browse/HIVE-24199?focusedWorklogId=496476=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496476 ] ASF GitHub Bot logged work on HIVE-24199: - Author: ASF GitHub Bot Created on: 07/Oct/20 10:39 Start Date: 07/Oct/20 10:39 Worklog Time Spent: 10m Work Description: kasakrisz commented on a change in pull request #1525: URL: https://github.com/apache/hive/pull/1525#discussion_r500910826 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSubQueryRemoveRule.java ## @@ -406,6 +409,16 @@ private RexNode rewriteInExists(RexSubQuery e, Set variablesSet, offset = offset + 1; builder.push(e.rel); } +} else if (e.getKind() == SqlKind.EXISTS && !variablesSet.isEmpty()) { + // Query has 'exists' and correlation: Review comment: Added comment This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496476) Time Spent: 40m (was: 0.5h) > Incorrect result when subquey in exists contains limit > -- > > Key: HIVE-24199 > URL: https://issues.apache.org/jira/browse/HIVE-24199 > Project: Hive > Issue Type: Bug >Reporter: Krisztian Kasa >Assignee: Krisztian Kasa >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > {code:java} > create table web_sales (ws_order_number int, ws_warehouse_sk int) stored as > orc; > insert into web_sales values > (1, 1), > (1, 2), > (2, 1), > (2, 2); > select * from web_sales ws1 > where exists (select 1 from web_sales ws2 where ws1.ws_order_number = > ws2.ws_order_number limit 1); > 1 1 > 1 2 > {code} > {code:java} > CBO PLAN: > HiveSemiJoin(condition=[=($0, $2)], joinType=[semi]) > HiveProject(ws_order_number=[$0], ws_warehouse_sk=[$1]) > HiveFilter(condition=[IS NOT NULL($0)]) > HiveTableScan(table=[[default, web_sales]], table:alias=[ws1]) > HiveProject(ws_order_number=[$0]) > HiveSortLimit(fetch=[1]) <-- This shouldn't be added > HiveProject(ws_order_number=[$0]) > HiveFilter(condition=[IS NOT NULL($0)]) > HiveTableScan(table=[[default, web_sales]], table:alias=[ws2]) > {code} > Limit n on the right side of the join reduces the result set coming from the > right to only n record hence not all the ws_order_number values are included > which leads to correctness issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24225) FIX S3A recordReader policy selection
[ https://issues.apache.org/jira/browse/HIVE-24225?focusedWorklogId=496472=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496472 ] ASF GitHub Bot logged work on HIVE-24225: - Author: ASF GitHub Bot Created on: 07/Oct/20 10:33 Start Date: 07/Oct/20 10:33 Worklog Time Spent: 10m Work Description: steveloughran commented on pull request #1547: URL: https://github.com/apache/hive/pull/1547#issuecomment-704847820 why the revert? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496472) Time Spent: 0.5h (was: 20m) > FIX S3A recordReader policy selection > - > > Key: HIVE-24225 > URL: https://issues.apache.org/jira/browse/HIVE-24225 > Project: Hive > Issue Type: Bug >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Dynamic S3A recordReader policy selection can cause issues on lazy > initialized FS objects -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-24234) Improve checkHashModeEfficiency in VectorGroupByOperator
[ https://issues.apache.org/jira/browse/HIVE-24234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209327#comment-17209327 ] Rajesh Balamohan commented on HIVE-24234: - Thanks [~mustafaiman]. >> (outputRecords) / (inputRecords * 1.0f) can be larger than 1 when grouping >> sets are present. No, it is other way around. {{sumBatchSize} already includes the computation needed for grouping sets. So in worst possible case, the max ratio would be "1.0". Since "1.0 > 1.0" would be false, the config still holds good. (i.e setting 1.0 would never move to streaming mode.) https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java#L206 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorGroupByOperator.java#L494 Basic idea is to ensure that, hashing with groupingsets has to be super effective (otherwise we end up paying the penalty of JVM mem pressure). Otherwise, it needs to bail out quickly and move to streaming mode. > Improve checkHashModeEfficiency in VectorGroupByOperator > > > Key: HIVE-24234 > URL: https://issues.apache.org/jira/browse/HIVE-24234 > Project: Hive > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Major > Labels: pull-request-available > Attachments: HIVE-24234.wip.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, {{VectorGroupByOperator::checkHashModeEfficiency}} compares the > number of entries with the number input records that have been processed. For > grouping sets, it accounts for grouping set length as well. > Issue is that, the condition becomes invalid after processing large number of > input records. This prevents the system from switching over to streaming > mode. > e.g Assume 500,000 input records processed, with 9 grouping sets, with > 100,000 entries in hashtable. Hashtable would never cross 4,500, entries > as the max size itself is 1M by default. > It would be good to compare the input records (adjusted for grouping sets) > with number of output records (along with size of hashtable size) to > determine hashing or streaming mode. > E.g Q67. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24234) Improve checkHashModeEfficiency in VectorGroupByOperator
[ https://issues.apache.org/jira/browse/HIVE-24234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-24234: -- Labels: pull-request-available (was: ) > Improve checkHashModeEfficiency in VectorGroupByOperator > > > Key: HIVE-24234 > URL: https://issues.apache.org/jira/browse/HIVE-24234 > Project: Hive > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Major > Labels: pull-request-available > Attachments: HIVE-24234.wip.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, {{VectorGroupByOperator::checkHashModeEfficiency}} compares the > number of entries with the number input records that have been processed. For > grouping sets, it accounts for grouping set length as well. > Issue is that, the condition becomes invalid after processing large number of > input records. This prevents the system from switching over to streaming > mode. > e.g Assume 500,000 input records processed, with 9 grouping sets, with > 100,000 entries in hashtable. Hashtable would never cross 4,500, entries > as the max size itself is 1M by default. > It would be good to compare the input records (adjusted for grouping sets) > with number of output records (along with size of hashtable size) to > determine hashing or streaming mode. > E.g Q67. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24234) Improve checkHashModeEfficiency in VectorGroupByOperator
[ https://issues.apache.org/jira/browse/HIVE-24234?focusedWorklogId=496334=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496334 ] ASF GitHub Bot logged work on HIVE-24234: - Author: ASF GitHub Bot Created on: 07/Oct/20 06:39 Start Date: 07/Oct/20 06:39 Worklog Time Spent: 10m Work Description: rbalamohan opened a new pull request #1560: URL: https://github.com/apache/hive/pull/1560 https://issues.apache.org/jira/browse/HIVE-24234 Queries with grouping sets process input records multiple times and increases significantly the number hash aggregation lookup operations. When there is not significant reduction with aggregation, it becomes memory intensive and adds up to JVM mem pressure. Earlier, due to minor bug, it wasn't switching over to hash mode. This has been fixed in the current patch and also takes care of the situation, when grouping sets are not very effective in reduction. Tried out Q67 in TPCDS in internal cluster which shows significant improvement with this. For standalone tests, TestVectorGroupByOperator covers this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496334) Remaining Estimate: 0h Time Spent: 10m > Improve checkHashModeEfficiency in VectorGroupByOperator > > > Key: HIVE-24234 > URL: https://issues.apache.org/jira/browse/HIVE-24234 > Project: Hive > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Major > Attachments: HIVE-24234.wip.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, {{VectorGroupByOperator::checkHashModeEfficiency}} compares the > number of entries with the number input records that have been processed. For > grouping sets, it accounts for grouping set length as well. > Issue is that, the condition becomes invalid after processing large number of > input records. This prevents the system from switching over to streaming > mode. > e.g Assume 500,000 input records processed, with 9 grouping sets, with > 100,000 entries in hashtable. Hashtable would never cross 4,500, entries > as the max size itself is 1M by default. > It would be good to compare the input records (adjusted for grouping sets) > with number of output records (along with size of hashtable size) to > determine hashing or streaming mode. > E.g Q67. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24238) ClassCastException in vectorized order-by query over avro table with uniontype column
[ https://issues.apache.org/jira/browse/HIVE-24238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabriel C Balan updated HIVE-24238: --- Component/s: Vectorization > ClassCastException in vectorized order-by query over avro table with > uniontype column > - > > Key: HIVE-24238 > URL: https://issues.apache.org/jira/browse/HIVE-24238 > Project: Hive > Issue Type: Bug > Components: Avro, Vectorization >Affects Versions: 3.1.0, 3.1.2 >Reporter: Gabriel C Balan >Priority: Minor > > {noformat:title=Reproducer} > create table avro_reproducer (key int, union_col uniontype ) > stored as avro location '/tmp/avro_reproducer'; > INSERT INTO TABLE avro_reproducer values (0, create_union(0, 123, 'not me')), > (1, create_union(1, -1, 'me, me, me!')); > --these queries are ok: > select count(*) from avro_reproducer; > select * from avro_reproducer; > --these queries are not ok > select * from avro_reproducer order by union_col; > select * from avro_reproducer sort by key; > select * from avro_reproducer order by 'does not have to be a column, > really'; > {noformat} > I have verified this reproducer on CDH703, HDP301. > It seems the issue is restricted to AVRO; this reproducer does not trigger > failures against textfile tables, orc tables, and parquet tables. > Also, the issue is restricted to vectorized execution; it goes away if I set > hive.vectorized.execution.enabled=false > {noformat:title=Error message in CLI} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > Caused by: java.lang.RuntimeException: Error processing row: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row > at > org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:155) > at > org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48) > at > org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27) > at > org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85) > at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1315) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime > Error while processing row > at > org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:970) > at >
[jira] [Work logged] (HIVE-24082) Expose information whether AcidUtils.ParsedDelta contains statementId
[ https://issues.apache.org/jira/browse/HIVE-24082?focusedWorklogId=496321=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-496321 ] ASF GitHub Bot logged work on HIVE-24082: - Author: ASF GitHub Bot Created on: 07/Oct/20 06:10 Start Date: 07/Oct/20 06:10 Worklog Time Spent: 10m Work Description: harmandeeps commented on a change in pull request #1438: URL: https://github.com/apache/hive/pull/1438#discussion_r500759183 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -1031,8 +1031,12 @@ public Path getPath() { return path; } +public boolean hasStatementId() { Review comment: yeah, we may need this information outside the Hive to figure out whether statementId is present for the delta. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 496321) Time Spent: 2h 20m (was: 2h 10m) > Expose information whether AcidUtils.ParsedDelta contains statementId > - > > Key: HIVE-24082 > URL: https://issues.apache.org/jira/browse/HIVE-24082 > Project: Hive > Issue Type: Improvement >Reporter: Piotr Findeisen >Priority: Major > Labels: pull-request-available > Time Spent: 2h 20m > Remaining Estimate: 0h > > In [Presto|https://prestosql.io] we support reading ORC ACID tables by > leveraging AcidUtils rather than duplicate the file name parsing logic in our > code. > To do this fully correctly, we need information whether > {{org.apache.hadoop.hive.ql.io.AcidUtils.ParsedDelta}} contains > {{statementId}} information or not. > Currently, a getter of that property does not allow us to access this > information. > [https://github.com/apache/hive/blob/468907eab36f78df3e14a24005153c9a23d62555/ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java#L804-L806] -- This message was sent by Atlassian Jira (v8.3.4#803005)