[jira] [Assigned] (HIVE-24233) except subquery throws nullpointer with cbo disabled
[ https://issues.apache.org/jira/browse/HIVE-24233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Varga reassigned HIVE-24233: -- > except subquery throws nullpointer with cbo disabled > > > Key: HIVE-24233 > URL: https://issues.apache.org/jira/browse/HIVE-24233 > Project: Hive > Issue Type: Bug >Reporter: Peter Varga >Assignee: Peter Varga >Priority: Major > > Except and intersect was only implemented with Calcite in HIVE-12764. If cbo > is disabled it would just throw a nullpointer exception. We should at least > throw a SemanticException stating this is not supported. > Repro: > set hive.cbo.enable=false; > create table test(id int); > insert into table test values(1); > select id from test except select id from test; -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (HIVE-23851) MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions
[ https://issues.apache.org/jira/browse/HIVE-23851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Syed Shameerur Rahman updated HIVE-23851: - Comment: was deleted (was: [~kgyrtkirk] As per your comments, I have changed the implementation. Please review the PR. Thanks.) > MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions > > > Key: HIVE-23851 > URL: https://issues.apache.org/jira/browse/HIVE-23851 > Project: Hive > Issue Type: Bug >Affects Versions: 4.0.0 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > *Steps to reproduce:* > # Create external table > # Run msck command to sync all the partitions with metastore > # Remove one of the partition path > # Run msck repair with partition filtering > *Stack Trace:* > {code:java} > 2020-07-15T02:10:29,045 ERROR [4dad298b-28b1-4e6b-94b6-aa785b60c576 main] > ppr.PartitionExpressionForMetastore: Failed to deserialize the expression > java.lang.IndexOutOfBoundsException: Index: 110, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:657) ~[?:1.8.0_192] > at java.util.ArrayList.get(ArrayList.java:433) ~[?:1.8.0_192] > at > org.apache.hive.com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hive.com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:707) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:211) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeObjectFromKryo(SerializationUtilities.java:806) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpressionFromKryo(SerializationUtilities.java:775) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.deserializeExpr(PartitionExpressionForMetastore.java:96) > [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.convertExprToFilter(PartitionExpressionForMetastore.java:52) > [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.PartFilterExprUtil.makeExpressionTree(PartFilterExprUtil.java:48) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:3593) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:80) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT-tests.jar:4.0.0-SNAPSHOT] > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_192] > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:1.8.0_192] > {code} > *Cause:* > In case of msck repair with partition filtering we expect expression proxy > class to be set as PartitionExpressionForMetastore ( > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckAnalyzer.java#L78 > ), While dropping partition we serialize the drop partition filter > expression as ( > https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java#L589 > ) which is incompatible during deserializtion happening in > PartitionExpressionForMetastore ( > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ppr/PartitionExpressionForMetastore.java#L52 > ) hence the query fails with Failed to deserialize the expression. > *Solutions*: > I could think of two approaches to this problem > # Since PartitionExpressionForMetastore is required only during parition > pruning step, We can switch back the expression proxy class to > MsckPartitionExpressionProxy once the partition pruning step is done. > # The other solution is to make serialization process in msck drop partition > filter expression compatible with the one with > PartitionExpressionForMetastore, We can do this via Reflection since the drop > partition serialization happens in Msck class (standadlone-metatsore) by this > way we can
[jira] [Issue Comment Deleted] (HIVE-23851) MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions
[ https://issues.apache.org/jira/browse/HIVE-23851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Syed Shameerur Rahman updated HIVE-23851: - Comment: was deleted (was: [~kgyrtkirk] Does the new approach makes sense?) > MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions > > > Key: HIVE-23851 > URL: https://issues.apache.org/jira/browse/HIVE-23851 > Project: Hive > Issue Type: Bug >Affects Versions: 4.0.0 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > *Steps to reproduce:* > # Create external table > # Run msck command to sync all the partitions with metastore > # Remove one of the partition path > # Run msck repair with partition filtering > *Stack Trace:* > {code:java} > 2020-07-15T02:10:29,045 ERROR [4dad298b-28b1-4e6b-94b6-aa785b60c576 main] > ppr.PartitionExpressionForMetastore: Failed to deserialize the expression > java.lang.IndexOutOfBoundsException: Index: 110, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:657) ~[?:1.8.0_192] > at java.util.ArrayList.get(ArrayList.java:433) ~[?:1.8.0_192] > at > org.apache.hive.com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hive.com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:707) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:211) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeObjectFromKryo(SerializationUtilities.java:806) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpressionFromKryo(SerializationUtilities.java:775) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.deserializeExpr(PartitionExpressionForMetastore.java:96) > [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.convertExprToFilter(PartitionExpressionForMetastore.java:52) > [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.PartFilterExprUtil.makeExpressionTree(PartFilterExprUtil.java:48) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:3593) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:80) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT-tests.jar:4.0.0-SNAPSHOT] > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_192] > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:1.8.0_192] > {code} > *Cause:* > In case of msck repair with partition filtering we expect expression proxy > class to be set as PartitionExpressionForMetastore ( > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckAnalyzer.java#L78 > ), While dropping partition we serialize the drop partition filter > expression as ( > https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java#L589 > ) which is incompatible during deserializtion happening in > PartitionExpressionForMetastore ( > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ppr/PartitionExpressionForMetastore.java#L52 > ) hence the query fails with Failed to deserialize the expression. > *Solutions*: > I could think of two approaches to this problem > # Since PartitionExpressionForMetastore is required only during parition > pruning step, We can switch back the expression proxy class to > MsckPartitionExpressionProxy once the partition pruning step is done. > # The other solution is to make serialization process in msck drop partition > filter expression compatible with the one with > PartitionExpressionForMetastore, We can do this via Reflection since the drop > partition serialization happens in Msck class (standadlone-metatsore) by this > way we can completely remove the need for class MsckPartitionExp
[jira] [Commented] (HIVE-23851) MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions
[ https://issues.apache.org/jira/browse/HIVE-23851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208490#comment-17208490 ] Syed Shameerur Rahman commented on HIVE-23851: -- [~kgyrtkirk] Could you please review the PR? > MSCK REPAIR Command With Partition Filtering Fails While Dropping Partitions > > > Key: HIVE-23851 > URL: https://issues.apache.org/jira/browse/HIVE-23851 > Project: Hive > Issue Type: Bug >Affects Versions: 4.0.0 >Reporter: Syed Shameerur Rahman >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > *Steps to reproduce:* > # Create external table > # Run msck command to sync all the partitions with metastore > # Remove one of the partition path > # Run msck repair with partition filtering > *Stack Trace:* > {code:java} > 2020-07-15T02:10:29,045 ERROR [4dad298b-28b1-4e6b-94b6-aa785b60c576 main] > ppr.PartitionExpressionForMetastore: Failed to deserialize the expression > java.lang.IndexOutOfBoundsException: Index: 110, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:657) ~[?:1.8.0_192] > at java.util.ArrayList.get(ArrayList.java:433) ~[?:1.8.0_192] > at > org.apache.hive.com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hive.com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:857) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:707) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities$KryoWithHooks.readObject(SerializationUtilities.java:211) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeObjectFromKryo(SerializationUtilities.java:806) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.SerializationUtilities.deserializeExpressionFromKryo(SerializationUtilities.java:775) > ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.deserializeExpr(PartitionExpressionForMetastore.java:96) > [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.optimizer.ppr.PartitionExpressionForMetastore.convertExprToFilter(PartitionExpressionForMetastore.java:52) > [hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.PartFilterExprUtil.makeExpressionTree(PartFilterExprUtil.java:48) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByExprInternal(ObjectStore.java:3593) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT] > at > org.apache.hadoop.hive.metastore.VerifyingObjectStore.getPartitionsByExpr(VerifyingObjectStore.java:80) > [hive-standalone-metastore-server-4.0.0-SNAPSHOT-tests.jar:4.0.0-SNAPSHOT] > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_192] > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:1.8.0_192] > {code} > *Cause:* > In case of msck repair with partition filtering we expect expression proxy > class to be set as PartitionExpressionForMetastore ( > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckAnalyzer.java#L78 > ), While dropping partition we serialize the drop partition filter > expression as ( > https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java#L589 > ) which is incompatible during deserializtion happening in > PartitionExpressionForMetastore ( > https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ppr/PartitionExpressionForMetastore.java#L52 > ) hence the query fails with Failed to deserialize the expression. > *Solutions*: > I could think of two approaches to this problem > # Since PartitionExpressionForMetastore is required only during parition > pruning step, We can switch back the expression proxy class to > MsckPartitionExpressionProxy once the partition pruning step is done. > # The other solution is to make serialization process in msck drop partition > filter expression compatible with the one with > PartitionExpressionForMetastore, We can do this via Reflection since the drop > partition serialization happens in Msck class (standadlone-metatsore) by this > way we can completely remove the need for
[jira] [Comment Edited] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization
[ https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208486#comment-17208486 ] Syed Shameerur Rahman edited comment on HIVE-18284 at 10/6/20, 5:15 AM: [~kgyrtkirk] [~jcamachorodriguez] [~ashutoshc] ping for review request! was (Author: srahman): [~kgyrtkirk] [~jcamachorodriguez] ping for review request! > NPE when inserting data with 'distribute by' clause with dynpart sort > optimization > -- > > Key: HIVE-18284 > URL: https://issues.apache.org/jira/browse/HIVE-18284 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 2.3.1, 2.3.2 >Reporter: Aki Tanaka >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > A Null Pointer Exception occurs when inserting data with 'distribute by' > clause. The following snippet query reproduces this issue: > *(non-vectorized , non-llap mode)* > {code:java} > create table table1 (col1 string, datekey int); > insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1); > create table table2 (col1 string) partitioned by (datekey int); > set hive.vectorized.execution.enabled=false; > set hive.optimize.sort.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nonstrict; > insert into table table2 > PARTITION(datekey) > select col1, > datekey > from table1 > distribute by datekey ; > {code} > I could run the insert query without the error if I remove Distribute By or > use Cluster By clause. > It seems that the issue happens because Distribute By does not guarantee > clustering or sorting properties on the distributed keys. > FileSinkOperator removes the previous fsp. FileSinkOperator will remove the > previous fsp which might be re-used when we use Distribute By. > https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972 > The following stack trace is logged. > {code:java} > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, > diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( > failure ) : > attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime > Error while processing row (tag=0) > {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) > ... 14 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) > at > org.apache.hadoop.hive.ql.exec.SelectOperator.p
[jira] [Work logged] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator
[ https://issues.apache.org/jira/browse/HIVE-24203?focusedWorklogId=495716&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495716 ] ASF GitHub Bot logged work on HIVE-24203: - Author: ASF GitHub Bot Created on: 06/Oct/20 05:00 Start Date: 06/Oct/20 05:00 Worklog Time Spent: 10m Work Description: okumin commented on a change in pull request #1531: URL: https://github.com/apache/hive/pull/1531#discussion_r56396 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java ## @@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, } } + /** + * LateralViewJoinOperator changes the data size and column level statistics. + * + * A diagram of LATERAL VIEW. + * + * [Lateral View Forward] + * / \ + *[Select] [Select] + *|| + *| [UDTF] + *\ / + * [Lateral View Join] + * + * For each row of the source, the left branch just picks columns and the right branch processes UDTF. + * And then LVJ joins a row from the left branch with rows from the right branch. + * The join has one-to-many relationship since UDTF can generate multiple rows. + * + * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides. + */ + public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor { +@Override +public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, + Object... nodeOutputs) throws SemanticException { + final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd; + final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx; + final HiveConf conf = aspCtx.getConf(); + + if (!isAllParentsContainStatistics(lop)) { +return null; + } + + final List> parents = lop.getParentOperators(); + if (parents.size() != 2) { +LOG.warn("LateralViewJoinOperator should have just two parents but actually has " ++ parents.size() + " parents."); +return null; + } + + final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics().clone(); Review comment: As for `udtfStats`, we can totally avoid clone. As for `udtfStats`, its column stats will be updated. However, looks like `StatsUtils.getColStatisticsFromExprMap` clones them? Anyway I think we can remove them if CI passes. I will try it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495716) Time Spent: 1h (was: 50m) > Implement stats annotation rule for the LateralViewJoinOperator > --- > > Key: HIVE-24203 > URL: https://issues.apache.org/jira/browse/HIVE-24203 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Affects Versions: 4.0.0, 3.1.2, 2.3.7 >Reporter: okumin >Assignee: okumin >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > StatsRulesProcFactory doesn't have any rules to handle a JOIN by LATERAL VIEW. > This can cause an underestimation in case that UDTF in LATERAL VIEW generates > multiple rows. > HIVE-20262 has already added the rule for UDTF. > This issue would add the rule for LateralViewJoinOperator. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization
[ https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208486#comment-17208486 ] Syed Shameerur Rahman commented on HIVE-18284: -- [~kgyrtkirk] [~jcamachorodriguez] ping for review request! > NPE when inserting data with 'distribute by' clause with dynpart sort > optimization > -- > > Key: HIVE-18284 > URL: https://issues.apache.org/jira/browse/HIVE-18284 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 2.3.1, 2.3.2 >Reporter: Aki Tanaka >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > A Null Pointer Exception occurs when inserting data with 'distribute by' > clause. The following snippet query reproduces this issue: > *(non-vectorized , non-llap mode)* > {code:java} > create table table1 (col1 string, datekey int); > insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1); > create table table2 (col1 string) partitioned by (datekey int); > set hive.vectorized.execution.enabled=false; > set hive.optimize.sort.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nonstrict; > insert into table table2 > PARTITION(datekey) > select col1, > datekey > from table1 > distribute by datekey ; > {code} > I could run the insert query without the error if I remove Distribute By or > use Cluster By clause. > It seems that the issue happens because Distribute By does not guarantee > clustering or sorting properties on the distributed keys. > FileSinkOperator removes the previous fsp. FileSinkOperator will remove the > previous fsp which might be re-used when we use Distribute By. > https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972 > The following stack trace is logged. > {code:java} > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, > diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( > failure ) : > attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime > Error while processing row (tag=0) > {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) > ... 14 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) > at > org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356) >
[jira] [Issue Comment Deleted] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization
[ https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Syed Shameerur Rahman updated HIVE-18284: - Comment: was deleted (was: [~jcamachorodriguez] Could you please review the PR?) > NPE when inserting data with 'distribute by' clause with dynpart sort > optimization > -- > > Key: HIVE-18284 > URL: https://issues.apache.org/jira/browse/HIVE-18284 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 2.3.1, 2.3.2 >Reporter: Aki Tanaka >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > A Null Pointer Exception occurs when inserting data with 'distribute by' > clause. The following snippet query reproduces this issue: > *(non-vectorized , non-llap mode)* > {code:java} > create table table1 (col1 string, datekey int); > insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1); > create table table2 (col1 string) partitioned by (datekey int); > set hive.vectorized.execution.enabled=false; > set hive.optimize.sort.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nonstrict; > insert into table table2 > PARTITION(datekey) > select col1, > datekey > from table1 > distribute by datekey ; > {code} > I could run the insert query without the error if I remove Distribute By or > use Cluster By clause. > It seems that the issue happens because Distribute By does not guarantee > clustering or sorting properties on the distributed keys. > FileSinkOperator removes the previous fsp. FileSinkOperator will remove the > previous fsp which might be re-used when we use Distribute By. > https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972 > The following stack trace is logged. > {code:java} > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, > diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( > failure ) : > attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime > Error while processing row (tag=0) > {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) > ... 14 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) > at > org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356) > ... 17 more > {code}
[jira] [Issue Comment Deleted] (HIVE-18284) NPE when inserting data with 'distribute by' clause with dynpart sort optimization
[ https://issues.apache.org/jira/browse/HIVE-18284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Syed Shameerur Rahman updated HIVE-18284: - Comment: was deleted (was: [~kgyrtkirk] I have addressed your comments. Please take a look!) > NPE when inserting data with 'distribute by' clause with dynpart sort > optimization > -- > > Key: HIVE-18284 > URL: https://issues.apache.org/jira/browse/HIVE-18284 > Project: Hive > Issue Type: Bug > Components: Query Processor >Affects Versions: 2.3.1, 2.3.2 >Reporter: Aki Tanaka >Assignee: Syed Shameerur Rahman >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > A Null Pointer Exception occurs when inserting data with 'distribute by' > clause. The following snippet query reproduces this issue: > *(non-vectorized , non-llap mode)* > {code:java} > create table table1 (col1 string, datekey int); > insert into table1 values ('ROW1', 1), ('ROW2', 2), ('ROW3', 1); > create table table2 (col1 string) partitioned by (datekey int); > set hive.vectorized.execution.enabled=false; > set hive.optimize.sort.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nonstrict; > insert into table table2 > PARTITION(datekey) > select col1, > datekey > from table1 > distribute by datekey ; > {code} > I could run the insert query without the error if I remove Distribute By or > use Cluster By clause. > It seems that the issue happens because Distribute By does not guarantee > clustering or sorting properties on the distributed keys. > FileSinkOperator removes the previous fsp. FileSinkOperator will remove the > previous fsp which might be re-used when we use Distribute By. > https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L972 > The following stack trace is logged. > {code:java} > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1513111717879_0056_1_01, > diagnostics=[Task failed, taskId=task_1513111717879_0056_1_01_00, > diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( > failure ) : > attempt_1513111717879_0056_1_01_00_0:java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row (tag=0) {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:211) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime > Error while processing row (tag=0) > {"key":{},"value":{"_col0":"ROW3","_col1":1}} > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:365) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:250) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:317) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) > ... 14 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:762) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) > at > org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:356) > ... 17 more
[jira] [Updated] (HIVE-24209) Incorrect search argument conversion for NOT BETWEEN operation when vectorization is enabled
[ https://issues.apache.org/jira/browse/HIVE-24209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated HIVE-24209: Fix Version/s: 4.0.0 Resolution: Fixed Status: Resolved (was: Patch Available) Pushed to master. Thanks, Ganesha! > Incorrect search argument conversion for NOT BETWEEN operation when > vectorization is enabled > > > Key: HIVE-24209 > URL: https://issues.apache.org/jira/browse/HIVE-24209 > Project: Hive > Issue Type: Bug > Components: Vectorization >Reporter: Ganesha Shreedhara >Assignee: Ganesha Shreedhara >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: HIVE-24209.patch > > Time Spent: 10m > Remaining Estimate: 0h > > We skipped adding GenericUDFOPNot UDF in filter expression for NOT BETWEEN > operation when vectorization is enabled because of the improvement done as > part of HIVE-15884. But, this is not handled during the conversion of filter > expression to search argument due to which incorrect predicate gets pushed > down to storage layer that leads to incorrect splits generation and incorrect > result. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24069) HiveHistory should log the task that ends abnormally
[ https://issues.apache.org/jira/browse/HIVE-24069?focusedWorklogId=495713&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495713 ] ASF GitHub Bot logged work on HIVE-24069: - Author: ASF GitHub Bot Created on: 06/Oct/20 04:29 Start Date: 06/Oct/20 04:29 Worklog Time Spent: 10m Work Description: dengzhhu653 commented on pull request #1429: URL: https://github.com/apache/hive/pull/1429#issuecomment-704020199 @ashutoshc Could you please take a look? thanks much! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495713) Time Spent: 40m (was: 0.5h) > HiveHistory should log the task that ends abnormally > > > Key: HIVE-24069 > URL: https://issues.apache.org/jira/browse/HIVE-24069 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Reporter: Zhihua Deng >Assignee: Zhihua Deng >Priority: Minor > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > When the task returns with the exitVal not equal to 0, The Executor would > skip marking the task return code and calling endTask. This may make the > history log incomplete for such tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HIVE-24224) Fix skipping header/footer for Hive on Tez on compressed files
[ https://issues.apache.org/jira/browse/HIVE-24224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan resolved HIVE-24224. - Fix Version/s: 4.0.0 Resolution: Fixed Pushed to master. Thanks, Panos! > Fix skipping header/footer for Hive on Tez on compressed files > -- > > Key: HIVE-24224 > URL: https://issues.apache.org/jira/browse/HIVE-24224 > Project: Hive > Issue Type: Bug >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Compressed file with Hive on Tez returns header and footers - for both > select * and select count ( * ): > {noformat} > printf "offset,id,other\n9,\"20200315 X00 1356\",123\n17,\"20200315 X00 > 1357\",123\nrst,rst,rst" > data.csv > hdfs dfs -put -f data.csv /apps/hive/warehouse/bz2test/bz2tbl1/ > bzip2 -f data.csv > hdfs dfs -put -f data.csv.bz2 /apps/hive/warehouse/bz2test/bz2tbl2/ > beeline -e "CREATE EXTERNAL TABLE default.bz2tst2 ( > sequence int, > id string, > other string) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' > LOCATION '/apps/hive/warehouse/bz2test/bz2tbl2' > TBLPROPERTIES ( > 'skip.header.line.count'='1', > 'skip.footer.line.count'='1');" > beeline -e " > SET hive.fetch.task.conversion = none; > SELECT * FROM default.bz2tst2;" > +---+++ > | bz2tst2.sequence | bz2tst2.id | bz2tst2.other | > +---+++ > | offset| id | other | > | 9 | 20200315 X00 1356 | 123| > | 17| 20200315 X00 1357 | 123| > | rst | rst| rst| > +---+++ > {noformat} > PS: HIVE-22769 addressed the issue for Hive on LLAP. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24205) Optimise CuckooSetBytes
[ https://issues.apache.org/jira/browse/HIVE-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated HIVE-24205: Fix Version/s: 4.0.0 Resolution: Fixed Status: Resolved (was: Patch Available) Pushed to master. Thanks, Mustafa! > Optimise CuckooSetBytes > --- > > Key: HIVE-24205 > URL: https://issues.apache.org/jira/browse/HIVE-24205 > Project: Hive > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Mustafa Iman >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: Screenshot 2020-09-28 at 4.29.24 PM.png, bench.png, > vectorized.patch > > Time Spent: 10m > Remaining Estimate: 0h > > {{FilterStringColumnInList, StringColumnInList}} etc use CuckooSetBytes for > lookup. > !Screenshot 2020-09-28 at 4.29.24 PM.png|width=714,height=508! > One option to optimize would be to add boundary conditions on "length" with > the min/max length stored in the hashes (ref: > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/CuckooSetBytes.java#L85]) > . This would significantly reduce the number of hash computation that needs > to happen. E.g > [TPCH-Q12|https://github.com/hortonworks/hive-testbench/blob/hdp3/sample-queries-tpch/tpch_query12.sql#L20] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24224) Fix skipping header/footer for Hive on Tez on compressed files
[ https://issues.apache.org/jira/browse/HIVE-24224?focusedWorklogId=495707&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495707 ] ASF GitHub Bot logged work on HIVE-24224: - Author: ASF GitHub Bot Created on: 06/Oct/20 04:02 Start Date: 06/Oct/20 04:02 Worklog Time Spent: 10m Work Description: ashutoshc commented on pull request #1546: URL: https://github.com/apache/hive/pull/1546#issuecomment-704013804 +1 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495707) Time Spent: 0.5h (was: 20m) > Fix skipping header/footer for Hive on Tez on compressed files > -- > > Key: HIVE-24224 > URL: https://issues.apache.org/jira/browse/HIVE-24224 > Project: Hive > Issue Type: Bug >Reporter: Panagiotis Garefalakis >Assignee: Panagiotis Garefalakis >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Compressed file with Hive on Tez returns header and footers - for both > select * and select count ( * ): > {noformat} > printf "offset,id,other\n9,\"20200315 X00 1356\",123\n17,\"20200315 X00 > 1357\",123\nrst,rst,rst" > data.csv > hdfs dfs -put -f data.csv /apps/hive/warehouse/bz2test/bz2tbl1/ > bzip2 -f data.csv > hdfs dfs -put -f data.csv.bz2 /apps/hive/warehouse/bz2test/bz2tbl2/ > beeline -e "CREATE EXTERNAL TABLE default.bz2tst2 ( > sequence int, > id string, > other string) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' > LOCATION '/apps/hive/warehouse/bz2test/bz2tbl2' > TBLPROPERTIES ( > 'skip.header.line.count'='1', > 'skip.footer.line.count'='1');" > beeline -e " > SET hive.fetch.task.conversion = none; > SELECT * FROM default.bz2tst2;" > +---+++ > | bz2tst2.sequence | bz2tst2.id | bz2tst2.other | > +---+++ > | offset| id | other | > | 9 | 20200315 X00 1356 | 123| > | 17| 20200315 X00 1357 | 123| > | rst | rst| rst| > +---+++ > {noformat} > PS: HIVE-22769 addressed the issue for Hive on LLAP. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24232) Incorrect translation of rollup expression from Calcite
[ https://issues.apache.org/jira/browse/HIVE-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-24232: -- Labels: pull-request-available (was: ) > Incorrect translation of rollup expression from Calcite > --- > > Key: HIVE-24232 > URL: https://issues.apache.org/jira/browse/HIVE-24232 > Project: Hive > Issue Type: Bug > Components: CBO >Reporter: Jesus Camacho Rodriguez >Assignee: Jesus Camacho Rodriguez >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In Calcite, it is not necessary that the columns in the group set are in the > same order as the rollup. For instance, this is the Calcite representation of > a rollup for a given query: > {code} > HiveAggregate(group=[{1, 6, 7}], groups=[[{1, 6, 7}, {1, 7}, {1}, {}]], > agg#0=[sum($12)], agg#1=[count($12)], agg#2=[sum($4)], agg#3=[count($4)], > agg#4=[sum($15)], agg#5=[count($15)]) > {code} > When we generate the Hive plan from the Calcite operator, we make such > assumption incorrectly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24232) Incorrect translation of rollup expression from Calcite
[ https://issues.apache.org/jira/browse/HIVE-24232?focusedWorklogId=495704&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495704 ] ASF GitHub Bot logged work on HIVE-24232: - Author: ASF GitHub Bot Created on: 06/Oct/20 03:34 Start Date: 06/Oct/20 03:34 Worklog Time Spent: 10m Work Description: jcamachor opened a new pull request #1554: URL: https://github.com/apache/hive/pull/1554 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495704) Remaining Estimate: 0h Time Spent: 10m > Incorrect translation of rollup expression from Calcite > --- > > Key: HIVE-24232 > URL: https://issues.apache.org/jira/browse/HIVE-24232 > Project: Hive > Issue Type: Bug > Components: CBO >Reporter: Jesus Camacho Rodriguez >Assignee: Jesus Camacho Rodriguez >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In Calcite, it is not necessary that the columns in the group set are in the > same order as the rollup. For instance, this is the Calcite representation of > a rollup for a given query: > {code} > HiveAggregate(group=[{1, 6, 7}], groups=[[{1, 6, 7}, {1, 7}, {1}, {}]], > agg#0=[sum($12)], agg#1=[count($12)], agg#2=[sum($4)], agg#3=[count($4)], > agg#4=[sum($15)], agg#5=[count($15)]) > {code} > When we generate the Hive plan from the Calcite operator, we make such > assumption incorrectly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24232) Incorrect translation of rollup expression from Calcite
[ https://issues.apache.org/jira/browse/HIVE-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jesus Camacho Rodriguez updated HIVE-24232: --- Status: Patch Available (was: Open) > Incorrect translation of rollup expression from Calcite > --- > > Key: HIVE-24232 > URL: https://issues.apache.org/jira/browse/HIVE-24232 > Project: Hive > Issue Type: Bug > Components: CBO >Reporter: Jesus Camacho Rodriguez >Assignee: Jesus Camacho Rodriguez >Priority: Major > > In Calcite, it is not necessary that the columns in the group set are in the > same order as the rollup. For instance, this is the Calcite representation of > a rollup for a given query: > {code} > HiveAggregate(group=[{1, 6, 7}], groups=[[{1, 6, 7}, {1, 7}, {1}, {}]], > agg#0=[sum($12)], agg#1=[count($12)], agg#2=[sum($4)], agg#3=[count($4)], > agg#4=[sum($15)], agg#5=[count($15)]) > {code} > When we generate the Hive plan from the Calcite operator, we make such > assumption incorrectly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-24232) Incorrect translation of rollup expression from Calcite
[ https://issues.apache.org/jira/browse/HIVE-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208469#comment-17208469 ] Jesus Camacho Rodriguez commented on HIVE-24232: The PR also adds printing of grouping sets for the Hive Group By operators in the Hive plan. > Incorrect translation of rollup expression from Calcite > --- > > Key: HIVE-24232 > URL: https://issues.apache.org/jira/browse/HIVE-24232 > Project: Hive > Issue Type: Bug > Components: CBO >Reporter: Jesus Camacho Rodriguez >Assignee: Jesus Camacho Rodriguez >Priority: Major > > In Calcite, it is not necessary that the columns in the group set are in the > same order as the rollup. For instance, this is the Calcite representation of > a rollup for a given query: > {code} > HiveAggregate(group=[{1, 6, 7}], groups=[[{1, 6, 7}, {1, 7}, {1}, {}]], > agg#0=[sum($12)], agg#1=[count($12)], agg#2=[sum($4)], agg#3=[count($4)], > agg#4=[sum($15)], agg#5=[count($15)]) > {code} > When we generate the Hive plan from the Calcite operator, we make such > assumption incorrectly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HIVE-24232) Incorrect translation of rollup expression from Calcite
[ https://issues.apache.org/jira/browse/HIVE-24232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jesus Camacho Rodriguez reassigned HIVE-24232: -- > Incorrect translation of rollup expression from Calcite > --- > > Key: HIVE-24232 > URL: https://issues.apache.org/jira/browse/HIVE-24232 > Project: Hive > Issue Type: Bug > Components: CBO >Reporter: Jesus Camacho Rodriguez >Assignee: Jesus Camacho Rodriguez >Priority: Major > > In Calcite, it is not necessary that the columns in the group set are in the > same order as the rollup. For instance, this is the Calcite representation of > a rollup for a given query: > {code} > HiveAggregate(group=[{1, 6, 7}], groups=[[{1, 6, 7}, {1, 7}, {1}, {}]], > agg#0=[sum($12)], agg#1=[count($12)], agg#2=[sum($4)], agg#3=[count($4)], > agg#4=[sum($15)], agg#5=[count($15)]) > {code} > When we generate the Hive plan from the Calcite operator, we make such > assumption incorrectly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator
[ https://issues.apache.org/jira/browse/HIVE-24203?focusedWorklogId=495700&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495700 ] ASF GitHub Bot logged work on HIVE-24203: - Author: ASF GitHub Bot Created on: 06/Oct/20 03:14 Start Date: 06/Oct/20 03:14 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #1531: URL: https://github.com/apache/hive/pull/1531#discussion_r497866947 ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java ## @@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, } } + /** + * LateralViewJoinOperator changes the data size and column level statistics. + * + * A diagram of LATERAL VIEW. + * + * [Lateral View Forward] + * / \ + *[Select] [Select] + *|| + *| [UDTF] + *\ / + * [Lateral View Join] + * + * For each row of the source, the left branch just picks columns and the right branch processes UDTF. + * And then LVJ joins a row from the left branch with rows from the right branch. + * The join has one-to-many relationship since UDTF can generate multiple rows. + * + * This rule multiplies the stats from the left branch by T(right) / T(left) and sums up the both sides. + */ + public static class LateralViewJoinStatsRule extends DefaultStatsRule implements SemanticNodeProcessor { +@Override +public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, + Object... nodeOutputs) throws SemanticException { + final LateralViewJoinOperator lop = (LateralViewJoinOperator) nd; + final AnnotateStatsProcCtx aspCtx = (AnnotateStatsProcCtx) procCtx; + final HiveConf conf = aspCtx.getConf(); + + if (!isAllParentsContainStatistics(lop)) { +return null; + } + + final List> parents = lop.getParentOperators(); + if (parents.size() != 2) { +LOG.warn("LateralViewJoinOperator should have just two parents but actually has " ++ parents.size() + " parents."); +return null; + } + + final Statistics selectStats = parents.get(LateralViewJoinOperator.SELECT_TAG).getStatistics().clone(); Review comment: Do you need to clone them? Are you modifying them? (Same for next line) ## File path: ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java ## @@ -2921,6 +2920,97 @@ public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx, } } + /** + * LateralViewJoinOperator changes the data size and column level statistics. + * + * A diagram of LATERAL VIEW. + * + * [Lateral View Forward] + * / \ + *[Select] [Select] + *|| + *| [UDTF] + *\ / + * [Lateral View Join] + * + * For each row of the source, the left branch just picks columns and the right branch processes UDTF. + * And then LVJ joins a row from the left branch with rows from the right branch. + * The join has one-to-many relationship since UDTF can generate multiple rows. Review comment: Just leaving a note. I took a quick look at the UDTF logic and it seems the selectivity is hardcoded via config. It seems the outer flag is not taken into account either, which could be a straightforward improvement for the estimates, i.e., UDFT will produce at least as many rows as it receives. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495700) Time Spent: 50m (was: 40m) > Implement stats annotation rule for the LateralViewJoinOperator > --- > > Key: HIVE-24203 > URL: https://issues.apache.org/jira/browse/HIVE-24203 > Project: Hive > Issue Type: Improvement > Components: Physical Optimizer >Affects Versions: 4.0.0, 3.1.2, 2.3.7 >Reporter: okumin >Assignee: okumin >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > StatsRulesProcFactory doesn't have any rules to handle a JOIN by LATERAL VIEW. > This can cause an underestimation in case that UDTF in LATERAL VIEW generates > multiple rows. > HIVE-20262 has already added the rule for UDTF. > This issue would add the rule for LateralViewJoinOperator.
[jira] [Work logged] (HIVE-24202) Clean up local HS2 HMS cache code (II)
[ https://issues.apache.org/jira/browse/HIVE-24202?focusedWorklogId=495682&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495682 ] ASF GitHub Bot logged work on HIVE-24202: - Author: ASF GitHub Bot Created on: 06/Oct/20 01:47 Start Date: 06/Oct/20 01:47 Worklog Time Spent: 10m Work Description: jcamachor commented on pull request #1543: URL: https://github.com/apache/hive/pull/1543#issuecomment-703980374 @vineetgarg02 , could you take a look? Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495682) Time Spent: 20m (was: 10m) > Clean up local HS2 HMS cache code (II) > -- > > Key: HIVE-24202 > URL: https://issues.apache.org/jira/browse/HIVE-24202 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Reporter: Jesus Camacho Rodriguez >Assignee: Jesus Camacho Rodriguez >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Follow-up for HIVE-24183 (split into different JIRAs). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23712) metadata-only queries return incorrect results with empty acid partition
[ https://issues.apache.org/jira/browse/HIVE-23712?focusedWorklogId=495661&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495661 ] ASF GitHub Bot logged work on HIVE-23712: - Author: ASF GitHub Bot Created on: 06/Oct/20 00:52 Start Date: 06/Oct/20 00:52 Worklog Time Spent: 10m Work Description: github-actions[bot] commented on pull request #1182: URL: https://github.com/apache/hive/pull/1182#issuecomment-703966215 This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the d...@hive.apache.org list if the patch is in need of reviews. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495661) Time Spent: 20m (was: 10m) > metadata-only queries return incorrect results with empty acid partition > > > Key: HIVE-23712 > URL: https://issues.apache.org/jira/browse/HIVE-23712 > Project: Hive > Issue Type: Bug >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Similarly to HIVE-15397, queries can return incorrect results for > metadata-only queries, here is a repro scenario which affects master: > {code} > set hive.support.concurrency=true; > set hive.exec.dynamic.partition.mode=nonstrict; > set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; > set hive.optimize.metadataonly=true; > create table test1 (id int, val string) partitioned by (val2 string) STORED > AS ORC TBLPROPERTIES ('transactional'='true'); > describe formatted test1; > alter table test1 add partition (val2='foo'); > alter table test1 add partition (val2='bar'); > insert into test1 partition (val2='foo') values (1, 'abc'); > select distinct val2, current_timestamp from test1; > insert into test1 partition (val2='bar') values (1, 'def'); > delete from test1 where val2 = 'bar'; > select '--> hive.optimize.metadataonly=true'; > select distinct val2, current_timestamp from test1; > set hive.optimize.metadataonly=false; > select '--> hive.optimize.metadataonly=false'; > select distinct val2, current_timestamp from test1; > select current_timestamp, * from test1; > {code} > in this case 2 rows returned instead of 1 after a delete with metadata only > optimization: > https://github.com/abstractdog/hive/commit/a7f03513564d01f7c3ba4aa61c4c6537100b4d3f#diff-cb23043000831f41fe7041cb38f82224R114-R128 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23757) Pushing TopN Key operator through MAPJOIN
[ https://issues.apache.org/jira/browse/HIVE-23757?focusedWorklogId=495662&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495662 ] ASF GitHub Bot logged work on HIVE-23757: - Author: ASF GitHub Bot Created on: 06/Oct/20 00:52 Start Date: 06/Oct/20 00:52 Worklog Time Spent: 10m Work Description: github-actions[bot] closed pull request #1181: URL: https://github.com/apache/hive/pull/1181 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495662) Time Spent: 40m (was: 0.5h) > Pushing TopN Key operator through MAPJOIN > - > > Key: HIVE-23757 > URL: https://issues.apache.org/jira/browse/HIVE-23757 > Project: Hive > Issue Type: Improvement >Reporter: Attila Magyar >Assignee: Attila Magyar >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > So far only MERGEJOIN + JOIN cases are handled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (HIVE-24209) Incorrect search argument conversion for NOT BETWEEN operation when vectorization is enabled
[ https://issues.apache.org/jira/browse/HIVE-24209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ganesha Shreedhara updated HIVE-24209: -- Comment: was deleted (was: [~ashutoshc] Thanks for reviewing. Please help with pushing this fix to master. ) > Incorrect search argument conversion for NOT BETWEEN operation when > vectorization is enabled > > > Key: HIVE-24209 > URL: https://issues.apache.org/jira/browse/HIVE-24209 > Project: Hive > Issue Type: Bug > Components: Vectorization >Reporter: Ganesha Shreedhara >Assignee: Ganesha Shreedhara >Priority: Major > Labels: pull-request-available > Attachments: HIVE-24209.patch > > Time Spent: 10m > Remaining Estimate: 0h > > We skipped adding GenericUDFOPNot UDF in filter expression for NOT BETWEEN > operation when vectorization is enabled because of the improvement done as > part of HIVE-15884. But, this is not handled during the conversion of filter > expression to search argument due to which incorrect predicate gets pushed > down to storage layer that leads to incorrect splits generation and incorrect > result. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24120) Plugin for external DatabaseProduct in standalone HMS
[ https://issues.apache.org/jira/browse/HIVE-24120?focusedWorklogId=495617&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495617 ] ASF GitHub Bot logged work on HIVE-24120: - Author: ASF GitHub Bot Created on: 05/Oct/20 22:22 Start Date: 05/Oct/20 22:22 Worklog Time Spent: 10m Work Description: gatorblue commented on a change in pull request #1470: URL: https://github.com/apache/hive/pull/1470#discussion_r499904277 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/DatabaseProduct.java ## @@ -20,71 +20,666 @@ import java.sql.SQLException; import java.sql.SQLTransactionRollbackException; +import java.sql.Timestamp; +import java.util.ArrayList; +import java.util.EnumMap; +import java.util.HashMap; +import java.util.List; +import java.util.Map; -/** Database product infered via JDBC. */ -public enum DatabaseProduct { - DERBY, MYSQL, POSTGRES, ORACLE, SQLSERVER, OTHER; +import org.apache.hadoop.conf.Configurable; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.hive.metastore.api.MetaException; +import org.apache.hadoop.hive.metastore.conf.MetastoreConf; +import org.apache.hadoop.hive.metastore.conf.MetastoreConf.ConfVars; +import org.apache.hadoop.util.ReflectionUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import com.google.common.base.Preconditions; + +/** Database product inferred via JDBC. Encapsulates all SQL logic associated with + * the database product. + * This class is a singleton, which is instantiated the first time + * method determineDatabaseProduct is invoked. + * Tests that need to create multiple instances can use the reset method + * */ +public class DatabaseProduct implements Configurable { + static final private Logger LOG = LoggerFactory.getLogger(DatabaseProduct.class.getName()); + + private static enum DbType {DERBY, MYSQL, POSTGRES, ORACLE, SQLSERVER, CUSTOM, UNDEFINED}; + public DbType dbType; + + // Singleton instance + private static DatabaseProduct theDatabaseProduct; + + Configuration myConf; + /** + * Protected constructor for singleton class + * @param id + */ + protected DatabaseProduct() {} + + public static final String DERBY_NAME = "derby"; + public static final String SQL_SERVER_NAME = "microsoft sql server"; + public static final String MYSQL_NAME = "mysql"; + public static final String POSTGRESQL_NAME = "postgresql"; + public static final String ORACLE_NAME = "oracle"; + public static final String UNDEFINED_NAME = "other"; + /** * Determine the database product type * @param productName string to defer database connection * @return database product type */ - public static DatabaseProduct determineDatabaseProduct(String productName) throws SQLException { -if (productName == null) { - return OTHER; + public static DatabaseProduct determineDatabaseProduct(String productName, Configuration c) { +DbType dbt; + +if (theDatabaseProduct != null) { + Preconditions.checkState(theDatabaseProduct.dbType == getDbType(productName)); + return theDatabaseProduct; } + +// This method may be invoked by concurrent connections +synchronized (DatabaseProduct.class) { + + if (productName == null) { +productName = UNDEFINED_NAME; + } + + dbt = getDbType(productName); + + // Check for null again in case of race condition + if (theDatabaseProduct == null) { +final Configuration conf = c!= null ? c : MetastoreConf.newMetastoreConf(); +// Check if we are using an external database product +boolean isExternal = MetastoreConf.getBoolVar(conf, ConfVars.USE_CUSTOM_RDBMS); + +if (isExternal) { + // The DatabaseProduct will be created by instantiating an external class via + // reflection. The external class can override any method in the current class + String className = MetastoreConf.getVar(conf, ConfVars.CUSTOM_RDBMS_CLASSNAME); + + if (className != null) { +try { + theDatabaseProduct = (DatabaseProduct) + ReflectionUtils.newInstance(Class.forName(className), conf); + + LOG.info(String.format("Using custom RDBMS %s. Overriding DbType: %s", className, dbt)); Review comment: Yeah, I put this for my own unit testing. Removed it now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495617) Time Spent: 1h 40m (was: 1.5h) > Plugin for external Datab
[jira] [Work logged] (HIVE-19253) HMS ignores tableType property for external tables
[ https://issues.apache.org/jira/browse/HIVE-19253?focusedWorklogId=495600&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495600 ] ASF GitHub Bot logged work on HIVE-19253: - Author: ASF GitHub Bot Created on: 05/Oct/20 21:49 Start Date: 05/Oct/20 21:49 Worklog Time Spent: 10m Work Description: vihangk1 commented on a change in pull request #1537: URL: https://github.com/apache/hive/pull/1537#discussion_r499890809 ## File path: standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestObjectStore.java ## @@ -931,6 +924,64 @@ public void testNotificationOps() throws InterruptedException, MetaException { Assert.assertEquals(0, eventResponse.getEventsSize()); } + /** + * Verify that table type is set correctly based on input table properties. + * Two things are verified: + * + * When EXTERNAL property is set to true, table type should be external + * When table type is set to external it should remain external + * + * @throws Exception + */ + @Test + public void testExternalTable() throws Exception { +Database db1 = new DatabaseBuilder() +.setName(DB1) +.setDescription("description") +.setLocation("locationurl") +.build(conf); +objectStore.createDatabase(db1); + +List tables = new ArrayList<>(4); +Map expectedValues = new HashMap<>(); + +int i = 1; +// Case 1: EXTERNAL = true, tableType == MANAGED_TABLE +// The result should be external table +Table tbl1 = buildTable(conf, db1, "t" + i++, true, null); +tables.add(tbl1); +expectedValues.put(tbl1.getTableName(), true); +// Case 2: EXTERNAL = false, tableType == EXTERNAL_TABLE +// The result should be external table +Table tbl2 = buildTable(conf, db1, "t" + i++, false, TableType.EXTERNAL_TABLE.name()); +tables.add(tbl2); +expectedValues.put(tbl2.getTableName(), true); +// Case 3: EXTERNAL = false, tableType == EXTERNAL_TABLE Review comment: the comment should state EXTERNAL = true ## File path: standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestObjectStore.java ## @@ -84,15 +85,7 @@ import java.sql.ResultSet; import java.sql.SQLException; import java.sql.Statement; -import java.util.ArrayList; -import java.util.Arrays; -import java.util.HashMap; -import java.util.HashSet; -import java.util.LinkedList; -import java.util.List; -import java.util.Random; -import java.util.Set; -import java.util.UUID; +import java.util.*; Review comment: this change can be reverted since we don't use wildcard imports. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495600) Time Spent: 20m (was: 10m) > HMS ignores tableType property for external tables > -- > > Key: HIVE-19253 > URL: https://issues.apache.org/jira/browse/HIVE-19253 > Project: Hive > Issue Type: Bug > Components: Metastore >Affects Versions: 3.0.0, 3.1.0, 4.0.0 >Reporter: Alex Kolbasov >Assignee: Vihang Karajgaonkar >Priority: Major > Labels: newbie, pull-request-available > Attachments: HIVE-19253.01.patch, HIVE-19253.02.patch, > HIVE-19253.03.patch, HIVE-19253.03.patch, HIVE-19253.04.patch, > HIVE-19253.05.patch, HIVE-19253.06.patch, HIVE-19253.07.patch, > HIVE-19253.08.patch, HIVE-19253.09.patch, HIVE-19253.10.patch, > HIVE-19253.11.patch, HIVE-19253.12.patch > > Time Spent: 20m > Remaining Estimate: 0h > > When someone creates a table using Thrift API they may think that setting > tableType to {{EXTERNAL_TABLE}} creates an external table. And boom - their > table is gone later because HMS will silently change it to managed table. > here is the offending code: > {code:java} > private MTable convertToMTable(Table tbl) throws InvalidObjectException, > MetaException { > ... > // If the table has property EXTERNAL set, update table type > // accordingly > String tableType = tbl.getTableType(); > boolean isExternal = > Boolean.parseBoolean(tbl.getParameters().get("EXTERNAL")); > if (TableType.MANAGED_TABLE.toString().equals(tableType)) { > if (isExternal) { > tableType = TableType.EXTERNAL_TABLE.toString(); > } > } > if (TableType.EXTERNAL_TABLE.toString().equals(tableType)) { > if (!isExternal) { // Here! > tableType = TableType.MANAGED
[jira] [Work logged] (HIVE-24120) Plugin for external DatabaseProduct in standalone HMS
[ https://issues.apache.org/jira/browse/HIVE-24120?focusedWorklogId=495585&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495585 ] ASF GitHub Bot logged work on HIVE-24120: - Author: ASF GitHub Bot Created on: 05/Oct/20 21:18 Start Date: 05/Oct/20 21:18 Worklog Time Spent: 10m Work Description: vihangk1 commented on a change in pull request #1470: URL: https://github.com/apache/hive/pull/1470#discussion_r499869697 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/DatabaseProduct.java ## @@ -20,71 +20,666 @@ import java.sql.SQLException; import java.sql.SQLTransactionRollbackException; +import java.sql.Timestamp; +import java.util.ArrayList; +import java.util.EnumMap; +import java.util.HashMap; +import java.util.List; +import java.util.Map; -/** Database product infered via JDBC. */ -public enum DatabaseProduct { - DERBY, MYSQL, POSTGRES, ORACLE, SQLSERVER, OTHER; +import org.apache.hadoop.conf.Configurable; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.hive.metastore.api.MetaException; +import org.apache.hadoop.hive.metastore.conf.MetastoreConf; +import org.apache.hadoop.hive.metastore.conf.MetastoreConf.ConfVars; +import org.apache.hadoop.util.ReflectionUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import com.google.common.base.Preconditions; + +/** Database product inferred via JDBC. Encapsulates all SQL logic associated with + * the database product. + * This class is a singleton, which is instantiated the first time + * method determineDatabaseProduct is invoked. + * Tests that need to create multiple instances can use the reset method + * */ +public class DatabaseProduct implements Configurable { + static final private Logger LOG = LoggerFactory.getLogger(DatabaseProduct.class.getName()); + + private static enum DbType {DERBY, MYSQL, POSTGRES, ORACLE, SQLSERVER, CUSTOM, UNDEFINED}; + public DbType dbType; + + // Singleton instance + private static DatabaseProduct theDatabaseProduct; + + Configuration myConf; + /** + * Protected constructor for singleton class + * @param id + */ + protected DatabaseProduct() {} + + public static final String DERBY_NAME = "derby"; + public static final String SQL_SERVER_NAME = "microsoft sql server"; + public static final String MYSQL_NAME = "mysql"; + public static final String POSTGRESQL_NAME = "postgresql"; + public static final String ORACLE_NAME = "oracle"; + public static final String UNDEFINED_NAME = "other"; + /** * Determine the database product type * @param productName string to defer database connection * @return database product type */ - public static DatabaseProduct determineDatabaseProduct(String productName) throws SQLException { -if (productName == null) { - return OTHER; + public static DatabaseProduct determineDatabaseProduct(String productName, Configuration c) { +DbType dbt; + +if (theDatabaseProduct != null) { + Preconditions.checkState(theDatabaseProduct.dbType == getDbType(productName)); + return theDatabaseProduct; } + +// This method may be invoked by concurrent connections +synchronized (DatabaseProduct.class) { + + if (productName == null) { +productName = UNDEFINED_NAME; + } + + dbt = getDbType(productName); + + // Check for null again in case of race condition + if (theDatabaseProduct == null) { +final Configuration conf = c!= null ? c : MetastoreConf.newMetastoreConf(); +// Check if we are using an external database product +boolean isExternal = MetastoreConf.getBoolVar(conf, ConfVars.USE_CUSTOM_RDBMS); + +if (isExternal) { + // The DatabaseProduct will be created by instantiating an external class via + // reflection. The external class can override any method in the current class + String className = MetastoreConf.getVar(conf, ConfVars.CUSTOM_RDBMS_CLASSNAME); + + if (className != null) { +try { + theDatabaseProduct = (DatabaseProduct) + ReflectionUtils.newInstance(Class.forName(className), conf); + + LOG.info(String.format("Using custom RDBMS %s. Overriding DbType: %s", className, dbt)); + dbt = DbType.CUSTOM; +}catch (Exception e) { + LOG.warn("Caught exception instantiating custom database product. Reverting to " + dbt, e); +} + } + else { Review comment: nit, the else could go in the same line as 113 as per the coding conventions. ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/DatabaseProduct.java ## @@ -20,71 +20,666 @@ import java.sql.SQLException; import java.sql.SQLT
[jira] [Work logged] (HIVE-24120) Plugin for external DatabaseProduct in standalone HMS
[ https://issues.apache.org/jira/browse/HIVE-24120?focusedWorklogId=495584&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495584 ] ASF GitHub Bot logged work on HIVE-24120: - Author: ASF GitHub Bot Created on: 05/Oct/20 21:12 Start Date: 05/Oct/20 21:12 Worklog Time Spent: 10m Work Description: vihangk1 commented on a change in pull request #1470: URL: https://github.com/apache/hive/pull/1470#discussion_r499869127 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/DatabaseProduct.java ## @@ -20,71 +20,666 @@ import java.sql.SQLException; import java.sql.SQLTransactionRollbackException; +import java.sql.Timestamp; +import java.util.ArrayList; +import java.util.EnumMap; +import java.util.HashMap; +import java.util.List; +import java.util.Map; -/** Database product infered via JDBC. */ -public enum DatabaseProduct { - DERBY, MYSQL, POSTGRES, ORACLE, SQLSERVER, OTHER; +import org.apache.hadoop.conf.Configurable; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.hive.metastore.api.MetaException; +import org.apache.hadoop.hive.metastore.conf.MetastoreConf; +import org.apache.hadoop.hive.metastore.conf.MetastoreConf.ConfVars; +import org.apache.hadoop.util.ReflectionUtils; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import com.google.common.base.Preconditions; + +/** Database product inferred via JDBC. Encapsulates all SQL logic associated with + * the database product. + * This class is a singleton, which is instantiated the first time + * method determineDatabaseProduct is invoked. + * Tests that need to create multiple instances can use the reset method + * */ +public class DatabaseProduct implements Configurable { + static final private Logger LOG = LoggerFactory.getLogger(DatabaseProduct.class.getName()); + + private static enum DbType {DERBY, MYSQL, POSTGRES, ORACLE, SQLSERVER, CUSTOM, UNDEFINED}; + public DbType dbType; + + // Singleton instance + private static DatabaseProduct theDatabaseProduct; + + Configuration myConf; + /** + * Protected constructor for singleton class + * @param id + */ + protected DatabaseProduct() {} + + public static final String DERBY_NAME = "derby"; + public static final String SQL_SERVER_NAME = "microsoft sql server"; + public static final String MYSQL_NAME = "mysql"; + public static final String POSTGRESQL_NAME = "postgresql"; + public static final String ORACLE_NAME = "oracle"; + public static final String UNDEFINED_NAME = "other"; + /** * Determine the database product type * @param productName string to defer database connection * @return database product type */ - public static DatabaseProduct determineDatabaseProduct(String productName) throws SQLException { -if (productName == null) { - return OTHER; + public static DatabaseProduct determineDatabaseProduct(String productName, Configuration c) { +DbType dbt; + +if (theDatabaseProduct != null) { + Preconditions.checkState(theDatabaseProduct.dbType == getDbType(productName)); + return theDatabaseProduct; } + +// This method may be invoked by concurrent connections +synchronized (DatabaseProduct.class) { + + if (productName == null) { +productName = UNDEFINED_NAME; + } + + dbt = getDbType(productName); + + // Check for null again in case of race condition + if (theDatabaseProduct == null) { +final Configuration conf = c!= null ? c : MetastoreConf.newMetastoreConf(); +// Check if we are using an external database product +boolean isExternal = MetastoreConf.getBoolVar(conf, ConfVars.USE_CUSTOM_RDBMS); + +if (isExternal) { + // The DatabaseProduct will be created by instantiating an external class via + // reflection. The external class can override any method in the current class + String className = MetastoreConf.getVar(conf, ConfVars.CUSTOM_RDBMS_CLASSNAME); + + if (className != null) { +try { + theDatabaseProduct = (DatabaseProduct) + ReflectionUtils.newInstance(Class.forName(className), conf); + + LOG.info(String.format("Using custom RDBMS %s. Overriding DbType: %s", className, dbt)); Review comment: The Overriding DbType: is bit confusing. Why is that log useful? ## File path: standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java ## @@ -1337,6 +1337,15 @@ public static ConfVars getMetaConf(String name) { HIVE_TXN_STATS_ENABLED("hive.txn.stats.enabled", "hive.txn.stats.enabled", true, "Whether Hive supports transactional stats (accurate stats for transactional tables)"), +// External RDBMS support +USE_CUSTOM_RDBMS
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495528&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495528 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 19:39 Start Date: 05/Oct/20 19:39 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499827644 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java ## @@ -237,6 +237,7 @@ void run(HiveConf conf, String jobName, Table t, Partition p, StorageDescriptor } JobConf job = createBaseJobConf(conf, jobName, t, sd, writeIds, ci); +QueryCompactor.Util.removeAbortedDirsForAcidTable(conf, dir); Review comment: removed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495528) Time Spent: 5.5h (was: 5h 20m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 5.5h > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495463&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495463 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 17:21 Start Date: 05/Oct/20 17:21 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499754946 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { + return true; +} else if (name.startsWith(baseDir(wId)) && !name.contains("=")) { + return true; +} + } + return false; +}; +List deleted = new ArrayList<>(); +deleteDeltaDirectoriesAux(rootPartition, fs, filter, deleted); +return deleted; + } + + private static void deleteDeltaDirectoriesAux(Path root, FileSystem fs, PathFilter filter, List deleted) + throws IOException { +RemoteIterator it = listIterator(fs, root, null); + +while (it.hasNext()) { + FileStatus fStatus = it.next(); + if (fStatus.isDirectory()) { +if (filter.accept(fStatus.getPath())) { + fs.delete(fStatus.getPath(), true); + deleted.add(fStatus); +} else { + deleteDeltaDirectoriesAux(fStatus.getPath(), fs, filter, deleted); + if (isDirectoryEmpty(fs, fStatus.getPath())) { +fs.delete(fStatus.getPath(), false); +deleted.add(fStatus); + } +} + } +} + } + + private static boolean isDirectoryEmpty(FileSystem fs, Path path) throws IOException { +RemoteIterator it = listIterator(fs, path, null); +return !it.hasNext(); + } + + private static RemoteIterator listIterator(FileSystem fs, Path path, PathFilter filter) + throws IOException { +try { + return new ToFileStatusIterator(SHIMS.listLocatedHdfsStatusIterator(fs, path, filter)); +} catch (Throwable t) { Review comment: removed it This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495463) Time Spent: 5h 20m (was: 5h 10m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 5h 20m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that >
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495460&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495460 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 17:20 Start Date: 05/Oct/20 17:20 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499754142 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { + return true; +} else if (name.startsWith(baseDir(wId)) && !name.contains("=")) { + return true; +} + } + return false; +}; +List deleted = new ArrayList<>(); +deleteDeltaDirectoriesAux(rootPartition, fs, filter, deleted); +return deleted; + } + + private static void deleteDeltaDirectoriesAux(Path root, FileSystem fs, PathFilter filter, List deleted) Review comment: changed to use getHdfsDirSnapshots, @pvargacl do you know. if i should access cached data somehow? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495460) Time Spent: 5h (was: 4h 50m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 5h > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495461&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495461 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 17:20 Start Date: 05/Oct/20 17:20 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499754423 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { Review comment: changed, also excluded base directory from listing This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495461) Time Spent: 5h 10m (was: 5h) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 5h 10m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495459&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495459 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 17:12 Start Date: 05/Oct/20 17:12 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499749748 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { + return true; +} else if (name.startsWith(baseDir(wId)) && !name.contains("=")) { + return true; +} + } + return false; +}; +List deleted = new ArrayList<>(); +deleteDeltaDirectoriesAux(rootPartition, fs, filter, deleted); +return deleted; + } + + private static void deleteDeltaDirectoriesAux(Path root, FileSystem fs, PathFilter filter, List deleted) + throws IOException { +RemoteIterator it = listIterator(fs, root, null); + +while (it.hasNext()) { + FileStatus fStatus = it.next(); + if (fStatus.isDirectory()) { +if (filter.accept(fStatus.getPath())) { + fs.delete(fStatus.getPath(), true); + deleted.add(fStatus); +} else { + deleteDeltaDirectoriesAux(fStatus.getPath(), fs, filter, deleted); + if (isDirectoryEmpty(fs, fStatus.getPath())) { Review comment: + partitions are not removed in HMS This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495459) Time Spent: 4h 50m (was: 4h 40m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 4h 50m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495457&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495457 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 17:08 Start Date: 05/Oct/20 17:08 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499747334 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/CompactionTxnHandler.java ## @@ -107,11 +107,12 @@ public CompactionTxnHandler() { // Check for aborted txns: number of aborted txns past threshold and age of aborted txns // past time threshold boolean checkAbortedTimeThreshold = abortedTimeThreshold >= 0; -final String sCheckAborted = "SELECT \"TC_DATABASE\", \"TC_TABLE\", \"TC_PARTITION\"," -+ "MIN(\"TXN_STARTED\"), COUNT(*)" +String sCheckAborted = "SELECT \"TC_DATABASE\", \"TC_TABLE\", \"TC_PARTITION\", " ++ "MIN(\"TXN_STARTED\"), COUNT(*), " ++ "MAX(CASE WHEN \"TC_OPERATION_TYPE\" = " + OperationType.DYNPART + " THEN 1 ELSE 0 END) AS \"IS_DP\" " Review comment: why is that? aborted dynPart is just a special case that would be handled separately (IS_DP=1). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495457) Time Spent: 4h 40m (was: 4.5h) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 4h 40m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24231) Enhance shared work optimizer to merge scans with semijoin filters on both sides
[ https://issues.apache.org/jira/browse/HIVE-24231?focusedWorklogId=495441&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495441 ] ASF GitHub Bot logged work on HIVE-24231: - Author: ASF GitHub Bot Created on: 05/Oct/20 16:49 Start Date: 05/Oct/20 16:49 Worklog Time Spent: 10m Work Description: kgyrtkirk opened a new pull request #1553: URL: https://github.com/apache/hive/pull/1553 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495441) Remaining Estimate: 0h Time Spent: 10m > Enhance shared work optimizer to merge scans with semijoin filters on both > sides > > > Key: HIVE-24231 > URL: https://issues.apache.org/jira/browse/HIVE-24231 > Project: Hive > Issue Type: Improvement >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24231) Enhance shared work optimizer to merge scans with semijoin filters on both sides
[ https://issues.apache.org/jira/browse/HIVE-24231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-24231: -- Labels: pull-request-available (was: ) > Enhance shared work optimizer to merge scans with semijoin filters on both > sides > > > Key: HIVE-24231 > URL: https://issues.apache.org/jira/browse/HIVE-24231 > Project: Hive > Issue Type: Improvement >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HIVE-24231) Enhance shared work optimizer to merge scans with semijoin filters on both sides
[ https://issues.apache.org/jira/browse/HIVE-24231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Haindrich reassigned HIVE-24231: --- > Enhance shared work optimizer to merge scans with semijoin filters on both > sides > > > Key: HIVE-24231 > URL: https://issues.apache.org/jira/browse/HIVE-24231 > Project: Hive > Issue Type: Improvement >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-23867) Truncate table fail with AccessControlException if doAs enabled and tbl database has source of replication
[ https://issues.apache.org/jira/browse/HIVE-23867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208129#comment-17208129 ] Anishek Agarwal commented on HIVE-23867: all managed table locations should be owned by hive. i dont think we should support otherwise cc [~thejas] > Truncate table fail with AccessControlException if doAs enabled and tbl > database has source of replication > -- > > Key: HIVE-23867 > URL: https://issues.apache.org/jira/browse/HIVE-23867 > Project: Hive > Issue Type: Bug > Components: Hive, repl >Affects Versions: 3.1.1 >Reporter: Rajkumar Singh >Priority: Major > > Steps to repro: > 1. enable doAs > 2. with some user (not a super user) create database > create database sampledb with dbproperties('repl.source.for'='1,2,3'); > 3. create table using create table sampledb.sampletble (id int); > 4. insert some data into it insert into sampledb.sampletble values (1), > (2),(3); > 5. Run truncate command on the table which fail with following error > {code:java} > org.apache.hadoop.ipc.RemoteException: User username is not a super user > (non-super user cannot change owner). > at > org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.setOwner(FSDirAttrOp.java:85) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setOwner(FSNamesystem.java:1907) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setOwner(NameNodeRpcServer.java:866) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setOwner(ClientNamenodeProtocolServerSideTranslatorPB.java:531) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1498) > ~[hadoop-common-3.1.1.3.1.5.0-152.jar:?] > at org.apache.hadoop.ipc.Client.call(Client.java:1444) > ~[hadoop-common-3.1.1.3.1.5.0-152.jar:?] > at org.apache.hadoop.ipc.Client.call(Client.java:1354) > ~[hadoop-common-3.1.1.3.1.5.0-152.jar:?] > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > ~[hadoop-common-3.1.1.3.1.5.0-152.jar:?] > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > ~[hadoop-common-3.1.1.3.1.5.0-152.jar:?] > at com.sun.proxy.$Proxy31.setOwner(Unknown Source) ~[?:?] > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setOwner(ClientNamenodeProtocolTranslatorPB.java:470) > ~[hadoop-hdfs-client-3.1.1.3.1.5.0-152.jar:?] > at sun.reflect.GeneratedMethodAccessor151.invoke(Unknown Source) ~[?:?] > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:1.8.0_232] > at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_232] > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > [hadoop-common-3.1.1.3.1.5.0-152.jar:?] > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > ~[hadoop-common-3.1.1.3.1.5.0-152.jar:?] > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > ~[hadoop-common-3.1.1.3.1.5.0-152.jar:?] > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > [hadoop-common-3.1.1.3.1.5.0-152.jar:?] > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > [hadoop-common-3.1.1.3.1.5.0-152.jar:?] > at com.sun.proxy.$Proxy32.setOwner(Unknown Source) [?:?] > at org.apache.hadoop.hdfs.DFSClient.setOwner(DFSClient.java:1914) > [hadoop-hdfs-client-3.1.1.3.1.5.0-152.jar:?] > at > org.apache.hadoop.hdfs.DistributedFileSystem$36.doCall(DistributedFileSystem.java:1764) > [hadoop-hdfs-client-3.1.1.3.1.5.0-152.jar:?] > at > org.apache.hadoop.hdfs.DistributedFileSystem$36.doCall(DistributedFileSystem.java:1761) > [hadoop-hdfs-client-3.1.1.3.1.5.0-152.jar:
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495383&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495383 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 15:10 Start Date: 05/Oct/20 15:10 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499671848 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { Review comment: You are right, I got confused, the p entry will solve this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495383) Time Spent: 4.5h (was: 4h 20m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 4.5h > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24227) sys.replication_metrics table shows incorrect status for failed policies
[ https://issues.apache.org/jira/browse/HIVE-24227?focusedWorklogId=495373&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495373 ] ASF GitHub Bot logged work on HIVE-24227: - Author: ASF GitHub Bot Created on: 05/Oct/20 14:53 Start Date: 05/Oct/20 14:53 Worklog Time Spent: 10m Work Description: ArkoSharma commented on a change in pull request #1550: URL: https://github.com/apache/hive/pull/1550#discussion_r499659807 ## File path: ql/src/java/org/apache/hadoop/hive/ql/ddl/DDLTask.java ## @@ -82,8 +89,32 @@ public int execute() { throw new IllegalArgumentException("Unknown DDL request: " + ddlDesc.getClass()); } } catch (Throwable e) { + LOG.error("DDLTask failed", e); + int errorCode = ErrorMsg.getErrorMsg(e.getMessage()).getErrorCode(); + try { +ReplicationMetricCollector metricCollector = work.getMetricCollector(); +if (errorCode > 4) { + //in case of replication related task, dumpDirectory should not be null + if(work.dumpDirectory != null) { +Path nonRecoverableMarker = new Path(work.dumpDirectory, ReplAck.NON_RECOVERABLE_MARKER.toString()); +org.apache.hadoop.hive.ql.parse.repl.dump.Utils.writeStackTrace(e, nonRecoverableMarker, conf); +if(metricCollector != null){ + metricCollector.reportStageEnd(getName(), Status.FAILED_ADMIN, nonRecoverableMarker.toString()); +} + } + if(metricCollector != null){ Review comment: In replication flows, dumpDirectory and metricCollector both should be non-null. This line considers the corner case where metricCollector might have been configured but not dumpDirectory. Still it is a replication case since only replication tasks can initialise and pass metricCollector. So we should indicate FAILED_ADMIN state at-least (non-recoverable path is null). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495373) Time Spent: 2.5h (was: 2h 20m) > sys.replication_metrics table shows incorrect status for failed policies > > > Key: HIVE-24227 > URL: https://issues.apache.org/jira/browse/HIVE-24227 > Project: Hive > Issue Type: Bug >Reporter: Arko Sharma >Assignee: Arko Sharma >Priority: Major > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495369&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495369 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 14:49 Start Date: 05/Oct/20 14:49 Worklog Time Spent: 10m Work Description: vpnvishv commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499656397 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { Review comment: I was also wondering the same, as this code was there in the earlier patches so I have just kept it. We can remove this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495369) Time Spent: 4h 20m (was: 4h 10m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 4h 20m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495368&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495368 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 14:47 Start Date: 05/Oct/20 14:47 Worklog Time Spent: 10m Work Description: vpnvishv commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499655306 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { Review comment: @pvargacl Sorry I may be missing something here, but with this change, how can compactor read the data of an aborted delta. It should be in the aborted list right, due to this dummy p type entry in TXN_COMPONENTS? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495368) Time Spent: 4h 10m (was: 4h) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 4h 10m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24227) sys.replication_metrics table shows incorrect status for failed policies
[ https://issues.apache.org/jira/browse/HIVE-24227?focusedWorklogId=495367&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495367 ] ASF GitHub Bot logged work on HIVE-24227: - Author: ASF GitHub Bot Created on: 05/Oct/20 14:46 Start Date: 05/Oct/20 14:46 Worklog Time Spent: 10m Work Description: ArkoSharma commented on a change in pull request #1550: URL: https://github.com/apache/hive/pull/1550#discussion_r499654760 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/repl/load/message/AlterDatabaseHandler.java ## @@ -77,9 +79,22 @@ alterDbDesc = new AlterDatabaseSetOwnerDesc(actualDbName, new PrincipalDesc(newDb.getOwnerName(), newDb.getOwnerType()), context.eventOnlyReplicationSpec()); } + Path metricPath = null; + ReplicationMetricCollector metricCollector = null; + try{ +metricPath = ReplUtils.getMetricPath(context, context.hiveConf); Review comment: hiveConf has default access in Context, can't be accessed by ReplUtils. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495367) Time Spent: 2h 20m (was: 2h 10m) > sys.replication_metrics table shows incorrect status for failed policies > > > Key: HIVE-24227 > URL: https://issues.apache.org/jira/browse/HIVE-24227 > Project: Hive > Issue Type: Bug >Reporter: Arko Sharma >Assignee: Arko Sharma >Priority: Major > Labels: pull-request-available > Time Spent: 2h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24197) Check for write transactions for the db under replication at a frequent interval
[ https://issues.apache.org/jira/browse/HIVE-24197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aasha Medhi updated HIVE-24197: --- Attachment: HIVE-24197.04.patch Status: Patch Available (was: In Progress) > Check for write transactions for the db under replication at a frequent > interval > > > Key: HIVE-24197 > URL: https://issues.apache.org/jira/browse/HIVE-24197 > Project: Hive > Issue Type: Task >Reporter: Aasha Medhi >Assignee: Aasha Medhi >Priority: Major > Attachments: HIVE-24197.01.patch, HIVE-24197.02.patch, > HIVE-24197.03.patch, HIVE-24197.04.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24197) Check for write transactions for the db under replication at a frequent interval
[ https://issues.apache.org/jira/browse/HIVE-24197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aasha Medhi updated HIVE-24197: --- Status: In Progress (was: Patch Available) > Check for write transactions for the db under replication at a frequent > interval > > > Key: HIVE-24197 > URL: https://issues.apache.org/jira/browse/HIVE-24197 > Project: Hive > Issue Type: Task >Reporter: Aasha Medhi >Assignee: Aasha Medhi >Priority: Major > Attachments: HIVE-24197.01.patch, HIVE-24197.02.patch, > HIVE-24197.03.patch, HIVE-24197.04.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495366&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495366 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 14:40 Start Date: 05/Oct/20 14:40 Worklog Time Spent: 10m Work Description: vpnvishv commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499649919 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java ## @@ -237,6 +237,7 @@ void run(HiveConf conf, String jobName, Table t, Partition p, StorageDescriptor } JobConf job = createBaseJobConf(conf, jobName, t, sd, writeIds, ci); +QueryCompactor.Util.removeAbortedDirsForAcidTable(conf, dir); Review comment: @pvargacl You are right, this is not required, as now compactor run in a transaction and the cleaner has validTxnList with aborted bits set. This we have added wrt to Hive-3, in which cleaner doesn't have aborted bits set, as we create validWriteIdList for cleaner based on the highestWriteId. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495366) Time Spent: 4h (was: 3h 50m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 4h > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-24230) Integrate HPL/SQL into HiveServer2
[ https://issues.apache.org/jira/browse/HIVE-24230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208111#comment-17208111 ] Attila Magyar commented on HIVE-24230: -- cc: [~kgyrtkirk] > Integrate HPL/SQL into HiveServer2 > -- > > Key: HIVE-24230 > URL: https://issues.apache.org/jira/browse/HIVE-24230 > Project: Hive > Issue Type: Bug > Components: HiveServer2, hpl/sql >Reporter: Attila Magyar >Assignee: Attila Magyar >Priority: Major > > HPL/SQL is a standalone command line program that can store and load scripts > from text files, or from Hive Metastore (since HIVE-24217). Currently HPL/SQL > depends on Hive and not the other way around. > Changing the dependency order between HPL/SQL and HiveServer would open up > some possibilities which are currently not feasable to implement. For example > one might want to use a third party SQL tool to run selects on stored > procedure (or rather function in this case) outputs. > {code:java} > SELECT * from myStoredProcedure(1, 2); {code} > HPL/SQL doesn’t have a JDBC interface and it’s not a daemon so this would not > work with the current architecture. > Another important factor is performance. Declarative SQL commands are sent to > Hive via JDBC by HPL/SQL. The integration would make it possible to drop JDBC > and use HiveSever’s internal API for compilation and execution. > The third factor is that existing tools like Beeline or Hue cannot be used > with HPL/SQL since it has its own, separated CLI. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HIVE-24230) Integrate HPL/SQL into HiveServer2
[ https://issues.apache.org/jira/browse/HIVE-24230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Magyar reassigned HIVE-24230: > Integrate HPL/SQL into HiveServer2 > -- > > Key: HIVE-24230 > URL: https://issues.apache.org/jira/browse/HIVE-24230 > Project: Hive > Issue Type: Bug > Components: HiveServer2, hpl/sql >Reporter: Attila Magyar >Assignee: Attila Magyar >Priority: Major > > HPL/SQL is a standalone command line program that can store and load scripts > from text files, or from Hive Metastore (since HIVE-24217). Currently HPL/SQL > depends on Hive and not the other way around. > Changing the dependency order between HPL/SQL and HiveServer would open up > some possibilities which are currently not feasable to implement. For example > one might want to use a third party SQL tool to run selects on stored > procedure (or rather function in this case) outputs. > {code:java} > SELECT * from myStoredProcedure(1, 2); {code} > HPL/SQL doesn’t have a JDBC interface and it’s not a daemon so this would not > work with the current architecture. > Another important factor is performance. Declarative SQL commands are sent to > Hive via JDBC by HPL/SQL. The integration would make it possible to drop JDBC > and use HiveSever’s internal API for compilation and execution. > The third factor is that existing tools like Beeline or Hue cannot be used > with HPL/SQL since it has its own, separated CLI. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24217) HMS storage backend for HPL/SQL stored procedures
[ https://issues.apache.org/jira/browse/HIVE-24217?focusedWorklogId=495360&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495360 ] ASF GitHub Bot logged work on HIVE-24217: - Author: ASF GitHub Bot Created on: 05/Oct/20 14:33 Start Date: 05/Oct/20 14:33 Worklog Time Spent: 10m Work Description: zeroflag commented on a change in pull request #1542: URL: https://github.com/apache/hive/pull/1542#discussion_r499644273 ## File path: standalone-metastore/metastore-server/src/main/resources/package.jdo ## @@ -1549,6 +1549,83 @@ + + + + + + + + + + + + + + + + + + + + + + Review comment: I changed it to CLOB, that is already used at multiple places. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495360) Time Spent: 1.5h (was: 1h 20m) > HMS storage backend for HPL/SQL stored procedures > - > > Key: HIVE-24217 > URL: https://issues.apache.org/jira/browse/HIVE-24217 > Project: Hive > Issue Type: Bug > Components: Hive, hpl/sql, Metastore >Reporter: Attila Magyar >Assignee: Attila Magyar >Priority: Major > Labels: pull-request-available > Attachments: HPL_SQL storedproc HMS storage.pdf > > Time Spent: 1.5h > Remaining Estimate: 0h > > HPL/SQL procedures are currently stored in text files. The goal of this Jira > is to implement a Metastore backend for storing and loading these procedures. > This is an incremental step towards having fully capable stored procedures in > Hive. > > See the attached design for more information. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495358&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495358 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 14:27 Start Date: 05/Oct/20 14:27 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499639831 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { + return true; +} else if (name.startsWith(baseDir(wId)) && !name.contains("=")) { + return true; +} + } + return false; +}; +List deleted = new ArrayList<>(); +deleteDeltaDirectoriesAux(rootPartition, fs, filter, deleted); +return deleted; + } + + private static void deleteDeltaDirectoriesAux(Path root, FileSystem fs, PathFilter filter, List deleted) + throws IOException { +RemoteIterator it = listIterator(fs, root, null); + +while (it.hasNext()) { + FileStatus fStatus = it.next(); + if (fStatus.isDirectory()) { +if (filter.accept(fStatus.getPath())) { + fs.delete(fStatus.getPath(), true); + deleted.add(fStatus); +} else { + deleteDeltaDirectoriesAux(fStatus.getPath(), fs, filter, deleted); + if (isDirectoryEmpty(fs, fStatus.getPath())) { Review comment: agree, that would simplify re-use of getHdfsDirSnapshots This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495358) Time Spent: 3h 50m (was: 3h 40m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 3h 50m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495356&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495356 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 14:25 Start Date: 05/Oct/20 14:25 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499623864 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { Review comment: Why would it read the aborted data as valid if txn is in still in aborted state? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495356) Time Spent: 3h 40m (was: 3.5h) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 3h 40m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24227) sys.replication_metrics table shows incorrect status for failed policies
[ https://issues.apache.org/jira/browse/HIVE-24227?focusedWorklogId=495355&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495355 ] ASF GitHub Bot logged work on HIVE-24227: - Author: ASF GitHub Bot Created on: 05/Oct/20 14:25 Start Date: 05/Oct/20 14:25 Worklog Time Spent: 10m Work Description: ArkoSharma commented on a change in pull request #1550: URL: https://github.com/apache/hive/pull/1550#discussion_r499638418 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/DirCopyTask.java ## @@ -140,7 +142,23 @@ public int execute() { } }); } catch (Exception e) { - throw new SecurityException(ErrorMsg.REPL_RETRY_EXHAUSTED.format(e.getMessage()), e); Review comment: This check is being done in the following lines. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495355) Time Spent: 2h 10m (was: 2h) > sys.replication_metrics table shows incorrect status for failed policies > > > Key: HIVE-24227 > URL: https://issues.apache.org/jira/browse/HIVE-24227 > Project: Hive > Issue Type: Bug >Reporter: Arko Sharma >Assignee: Arko Sharma >Priority: Major > Labels: pull-request-available > Time Spent: 2h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24229) DirectSql fails in case of OracleDB
[ https://issues.apache.org/jira/browse/HIVE-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-24229: -- Labels: pull-request-available (was: ) > DirectSql fails in case of OracleDB > --- > > Key: HIVE-24229 > URL: https://issues.apache.org/jira/browse/HIVE-24229 > Project: Hive > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Direct Sql fails due to different data type mapping incase of Oracle DB -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24229) DirectSql fails in case of OracleDB
[ https://issues.apache.org/jira/browse/HIVE-24229?focusedWorklogId=495351&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495351 ] ASF GitHub Bot logged work on HIVE-24229: - Author: ASF GitHub Bot Created on: 05/Oct/20 14:15 Start Date: 05/Oct/20 14:15 Worklog Time Spent: 10m Work Description: ayushtkn opened a new pull request #1552: URL: https://github.com/apache/hive/pull/1552 https://issues.apache.org/jira/browse/HIVE-24229 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495351) Remaining Estimate: 0h Time Spent: 10m > DirectSql fails in case of OracleDB > --- > > Key: HIVE-24229 > URL: https://issues.apache.org/jira/browse/HIVE-24229 > Project: Hive > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Time Spent: 10m > Remaining Estimate: 0h > > Direct Sql fails due to different data type mapping incase of Oracle DB -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-22826) ALTER TABLE RENAME COLUMN doesn't update list of bucketed column names
[ https://issues.apache.org/jira/browse/HIVE-22826?focusedWorklogId=495349&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495349 ] ASF GitHub Bot logged work on HIVE-22826: - Author: ASF GitHub Bot Created on: 05/Oct/20 14:14 Start Date: 05/Oct/20 14:14 Worklog Time Spent: 10m Work Description: ashish-kumar-sharma commented on a change in pull request #1528: URL: https://github.com/apache/hive/pull/1528#discussion_r498902147 ## File path: ql/src/test/queries/clientpositive/alter_numbuckets_partitioned_table_h23.q ## @@ -52,6 +52,12 @@ alter table tst1_n1 clustered by (value) into 12 buckets; describe formatted tst1_n1; +-- Test changing name of bucket column + +alter table tst1_n1 change key keys string; + +describe formatted tst1_n1; Review comment: after adding show create table test start failing. because result expect "### masked information ". which lead to multiple test failure. The information shown by show create table is same as describe table as we are only change column name. Hence after multiple retry I decide to remove it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495349) Time Spent: 2h 50m (was: 2h 40m) > ALTER TABLE RENAME COLUMN doesn't update list of bucketed column names > --- > > Key: HIVE-22826 > URL: https://issues.apache.org/jira/browse/HIVE-22826 > Project: Hive > Issue Type: Bug > Components: Query Planning >Affects Versions: 4.0.0 >Reporter: Karen Coppage >Assignee: Ashish Sharma >Priority: Major > Labels: pull-request-available > Attachments: unitTest.patch > > Time Spent: 2h 50m > Remaining Estimate: 0h > > Compaction for tables where a bucketed column has been renamed fails since > the list of bucketed columns in the StorageDescriptor doesn't get updated > when the column is renamed, therefore we can't recreate the table correctly > during compaction. > Attached a unit test that fails. > NO PRECOMMIT TESTS -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495341&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495341 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 14:05 Start Date: 05/Oct/20 14:05 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499623864 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { Review comment: Why would it read the aborted data as valid if txn is in aborted state? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495341) Time Spent: 3.5h (was: 3h 20m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 3.5h > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495338&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495338 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 14:00 Start Date: 05/Oct/20 14:00 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499620518 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { + return true; +} else if (name.startsWith(baseDir(wId)) && !name.contains("=")) { + return true; +} + } + return false; +}; +List deleted = new ArrayList<>(); +deleteDeltaDirectoriesAux(rootPartition, fs, filter, deleted); +return deleted; + } + + private static void deleteDeltaDirectoriesAux(Path root, FileSystem fs, PathFilter filter, List deleted) Review comment: getHdfsDirSnapshots does the same recursive listing, isn't it? ``` RemoteIterator itr = fs.listFiles(path, true); while (itr.hasNext()) { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495338) Time Spent: 3h 20m (was: 3h 10m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 3h 20m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HIVE-24229) DirectSql fails in case of OracleDB
[ https://issues.apache.org/jira/browse/HIVE-24229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena reassigned HIVE-24229: --- > DirectSql fails in case of OracleDB > --- > > Key: HIVE-24229 > URL: https://issues.apache.org/jira/browse/HIVE-24229 > Project: Hive > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > > Direct Sql fails due to different data type mapping incase of Oracle DB -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495288&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495288 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 11:59 Start Date: 05/Oct/20 11:59 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499545127 ## File path: ql/src/test/org/apache/hadoop/hive/ql/TestTxnCommands2.java ## @@ -2128,6 +2128,395 @@ public void testCleanerForTxnToWriteId() throws Exception { 0, TxnDbUtil.countQueryAgent(hiveConf, "select count(*) from TXN_TO_WRITE_ID")); } + @Test +public void testMmTableAbortWithCompaction() throws Exception { +// 1. Insert some rows into MM table +runStatementOnDriver("insert into " + Table.MMTBL + "(a,b) values(1,2)"); +// There should be 1 delta directory +int [][] resultData1 = new int[][] {{1,2}}; +verifyDeltaDirAndResult(1, Table.MMTBL.toString(), "", resultData1); +List r1 = runStatementOnDriver("select count(*) from " + Table.MMTBL); +Assert.assertEquals("1", r1.get(0)); + +// 2. Let a transaction be aborted +hiveConf.setBoolVar(HiveConf.ConfVars.HIVETESTMODEROLLBACKTXN, true); +runStatementOnDriver("insert into " + Table.MMTBL + "(a,b) values(3,4)"); +hiveConf.setBoolVar(HiveConf.ConfVars.HIVETESTMODEROLLBACKTXN, false); +// There should be 1 delta and 1 base directory. The base one is the aborted one. +verifyDeltaDirAndResult(2, Table.MMTBL.toString(), "", resultData1); + +r1 = runStatementOnDriver("select count(*) from " + Table.MMTBL); +Assert.assertEquals("1", r1.get(0)); + +// Verify query result +int [][] resultData2 = new int[][] {{1,2}, {5,6}}; + +runStatementOnDriver("insert into " + Table.MMTBL + "(a,b) values(5,6)"); +verifyDeltaDirAndResult(3, Table.MMTBL.toString(), "", resultData2); +r1 = runStatementOnDriver("select count(*) from " + Table.MMTBL); +Assert.assertEquals("2", r1.get(0)); + +// 4. Perform a MINOR compaction, expectation is it should remove aborted base dir +runStatementOnDriver("alter table "+ Table.MMTBL + " compact 'MINOR'"); +// The worker should remove the subdir for aborted transaction +runWorker(hiveConf); +verifyDeltaDirAndResult(2, Table.MMTBL.toString(), "", resultData2); +verifyBaseDirAndResult(0, Table.MMTBL.toString(), "", resultData2); +// 5. Run Cleaner. Shouldn't impact anything. +runCleaner(hiveConf); +// 6. Run initiator remove aborted entry from TXNS table +runInitiator(hiveConf); + +// Verify query result +List rs = runStatementOnDriver("select a,b from " + Table.MMTBL + " order by a"); +Assert.assertEquals(stringifyValues(resultData2), rs); + +int [][] resultData3 = new int[][] {{1,2}, {5,6}, {7,8}}; +// 7. add few more rows +runStatementOnDriver("insert into " + Table.MMTBL + "(a,b) values(7,8)"); +// 8. add one more aborted delta +hiveConf.setBoolVar(HiveConf.ConfVars.HIVETESTMODEROLLBACKTXN, true); +runStatementOnDriver("insert into " + Table.MMTBL + "(a,b) values(9,10)"); +hiveConf.setBoolVar(HiveConf.ConfVars.HIVETESTMODEROLLBACKTXN, false); + +// 9. Perform a MAJOR compaction, expectation is it should remove aborted base dir +runStatementOnDriver("alter table "+ Table.MMTBL + " compact 'MAJOR'"); +verifyDeltaDirAndResult(4, Table.MMTBL.toString(), "", resultData3); +runWorker(hiveConf); +verifyDeltaDirAndResult(3, Table.MMTBL.toString(), "", resultData3); +verifyBaseDirAndResult(1, Table.MMTBL.toString(), "", resultData3); +runCleaner(hiveConf); +verifyDeltaDirAndResult(0, Table.MMTBL.toString(), "", resultData3); +verifyBaseDirAndResult(1, Table.MMTBL.toString(), "", resultData3); +runInitiator(hiveConf); +verifyDeltaDirAndResult(0, Table.MMTBL.toString(), "", resultData3); +verifyBaseDirAndResult(1, Table.MMTBL.toString(), "", resultData3); + +// Verify query result +rs = runStatementOnDriver("select a,b from " + Table.MMTBL + " order by a"); +Assert.assertEquals(stringifyValues(resultData3), rs); + } + @Test + public void testMmTableAbortWithCompactionNoCleanup() throws Exception { +// 1. Insert some rows into MM table +runStatementOnDriver("insert into " + Table.MMTBL + "(a,b) values(1,2)"); +runStatementOnDriver("insert into " + Table.MMTBL + "(a,b) values(5,6)"); +// There should be 1 delta directory +int [][] resultData1 = new int[][] {{1,2}, {5,6}}; +verifyDeltaDirAndResult(2, Table.MMTBL.toString(), "", resultData1); +List r1 = runStatementOnDriver("select count(*) from " + Table.MMTBL); +Assert.assertEquals("2", r1.get(0)); + +// 2. Let a transaction be aborted +hiveConf.set
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495289&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495289 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 11:59 Start Date: 05/Oct/20 11:59 Worklog Time Spent: 10m Work Description: deniskuzZ commented on pull request #1548: URL: https://github.com/apache/hive/pull/1548#issuecomment-703585092 > @deniskuzZ Overall change LGTM. > Looked into the test failures, one of test requires change in expected values wrt master branch. Other two looks genuine failures to me. Please check the inline comments. @vpnvishv, thank you for the review! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495289) Time Spent: 3h 10m (was: 3h) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 3h 10m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495287&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495287 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 11:58 Start Date: 05/Oct/20 11:58 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499544579 ## File path: ql/src/test/org/apache/hadoop/hive/ql/TestTxnCommands2.java ## @@ -2128,6 +2128,395 @@ public void testCleanerForTxnToWriteId() throws Exception { 0, TxnDbUtil.countQueryAgent(hiveConf, "select count(*) from TXN_TO_WRITE_ID")); } + @Test +public void testMmTableAbortWithCompaction() throws Exception { +// 1. Insert some rows into MM table +runStatementOnDriver("insert into " + Table.MMTBL + "(a,b) values(1,2)"); +// There should be 1 delta directory +int [][] resultData1 = new int[][] {{1,2}}; +verifyDeltaDirAndResult(1, Table.MMTBL.toString(), "", resultData1); +List r1 = runStatementOnDriver("select count(*) from " + Table.MMTBL); +Assert.assertEquals("1", r1.get(0)); + +// 2. Let a transaction be aborted +hiveConf.setBoolVar(HiveConf.ConfVars.HIVETESTMODEROLLBACKTXN, true); +runStatementOnDriver("insert into " + Table.MMTBL + "(a,b) values(3,4)"); +hiveConf.setBoolVar(HiveConf.ConfVars.HIVETESTMODEROLLBACKTXN, false); +// There should be 1 delta and 1 base directory. The base one is the aborted one. +verifyDeltaDirAndResult(2, Table.MMTBL.toString(), "", resultData1); + +r1 = runStatementOnDriver("select count(*) from " + Table.MMTBL); +Assert.assertEquals("1", r1.get(0)); + +// Verify query result +int [][] resultData2 = new int[][] {{1,2}, {5,6}}; + +runStatementOnDriver("insert into " + Table.MMTBL + "(a,b) values(5,6)"); +verifyDeltaDirAndResult(3, Table.MMTBL.toString(), "", resultData2); +r1 = runStatementOnDriver("select count(*) from " + Table.MMTBL); +Assert.assertEquals("2", r1.get(0)); Review comment: fixed, turned off StatsOptimizer ## File path: ql/src/test/org/apache/hadoop/hive/ql/TestTxnCommands2.java ## @@ -2128,6 +2128,395 @@ public void testCleanerForTxnToWriteId() throws Exception { 0, TxnDbUtil.countQueryAgent(hiveConf, "select count(*) from TXN_TO_WRITE_ID")); } + @Test +public void testMmTableAbortWithCompaction() throws Exception { +// 1. Insert some rows into MM table +runStatementOnDriver("insert into " + Table.MMTBL + "(a,b) values(1,2)"); +// There should be 1 delta directory +int [][] resultData1 = new int[][] {{1,2}}; +verifyDeltaDirAndResult(1, Table.MMTBL.toString(), "", resultData1); +List r1 = runStatementOnDriver("select count(*) from " + Table.MMTBL); +Assert.assertEquals("1", r1.get(0)); + +// 2. Let a transaction be aborted +hiveConf.setBoolVar(HiveConf.ConfVars.HIVETESTMODEROLLBACKTXN, true); +runStatementOnDriver("insert into " + Table.MMTBL + "(a,b) values(3,4)"); +hiveConf.setBoolVar(HiveConf.ConfVars.HIVETESTMODEROLLBACKTXN, false); +// There should be 1 delta and 1 base directory. The base one is the aborted one. +verifyDeltaDirAndResult(2, Table.MMTBL.toString(), "", resultData1); + +r1 = runStatementOnDriver("select count(*) from " + Table.MMTBL); +Assert.assertEquals("1", r1.get(0)); + +// Verify query result +int [][] resultData2 = new int[][] {{1,2}, {5,6}}; + +runStatementOnDriver("insert into " + Table.MMTBL + "(a,b) values(5,6)"); +verifyDeltaDirAndResult(3, Table.MMTBL.toString(), "", resultData2); +r1 = runStatementOnDriver("select count(*) from " + Table.MMTBL); +Assert.assertEquals("2", r1.get(0)); + +// 4. Perform a MINOR compaction, expectation is it should remove aborted base dir +runStatementOnDriver("alter table "+ Table.MMTBL + " compact 'MINOR'"); +// The worker should remove the subdir for aborted transaction +runWorker(hiveConf); +verifyDeltaDirAndResult(2, Table.MMTBL.toString(), "", resultData2); +verifyBaseDirAndResult(0, Table.MMTBL.toString(), "", resultData2); +// 5. Run Cleaner. Shouldn't impact anything. +runCleaner(hiveConf); +// 6. Run initiator remove aborted entry from TXNS table +runInitiator(hiveConf); + +// Verify query result +List rs = runStatementOnDriver("select a,b from " + Table.MMTBL + " order by a"); +Assert.assertEquals(stringifyValues(resultData2), rs); + +int [][] resultData3 = new int[][] {{1,2}, {5,6}, {7,8}}; +// 7. add few more rows +runStatementOnDriver("insert into " + Table.MMTBL + "(a,b) values(7,8)"); +// 8. add one more aborted delta +hiveConf.setBoolVar(HiveConf.ConfVars.HIVETESTMODEROLLBACKTXN, true); +runSt
[jira] [Work logged] (HIVE-24227) sys.replication_metrics table shows incorrect status for failed policies
[ https://issues.apache.org/jira/browse/HIVE-24227?focusedWorklogId=495283&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495283 ] ASF GitHub Bot logged work on HIVE-24227: - Author: ASF GitHub Bot Created on: 05/Oct/20 11:50 Start Date: 05/Oct/20 11:50 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1550: URL: https://github.com/apache/hive/pull/1550#discussion_r499539993 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -280,6 +299,49 @@ public static PathFilter getBootstrapDirectoryFilter(final FileSystem fs) { }; } + public static Path getMetricPath(MessageHandler.Context context, HiveConf hiveConf) throws Exception{ +DumpType dumpType; +Path metricPath = null; +String dumpMetaFile = DumpMetaData.getDmdFileName(); +FileSystem fs = null; +if(context.dmd != null) { + dumpType = context.dmd.getDumpType(); + fs = context.dmd.getDumpFilePath().getFileSystem(hiveConf); + metricPath = context.dmd.getDumpFilePath().getParent(); +} +else { + dumpType = null; + if(context.location != null){ +metricPath = (new Path(context.location)).getParent(); +fs = (new Path(context.location)).getFileSystem(hiveConf); + } +} +//traverse to hiveDumpRoot required by metric-collector +while (metricPath != null && fs != null && dumpType != DumpType.BOOTSTRAP && dumpType != DumpType.INCREMENTAL) { + metricPath = metricPath.getParent(); + if (fs.exists(new Path(metricPath, dumpMetaFile))) { +dumpType = (new DumpMetaData(metricPath, hiveConf)).getDumpType(); + } +} +return metricPath; + } + + public static ReplicationMetricCollector getMetricCollector(MessageHandler.Context context, String dbName, + Path metricPath, HiveConf hiveConf) throws Exception { +if (metricPath != null) { + DumpType dumpType = (new DumpMetaData(metricPath, hiveConf)).getDumpType(); + //for using this, dumpType should be either INCREMENTAL or BOOTSTRAP. + if (dumpType == DumpType.BOOTSTRAP) { +return new BootstrapLoadMetricCollector(dbName, metricPath.toString(), +context.dmd.getDumpExecutionId(), hiveConf); Review comment: you can pass just the DumpExecutionId. no need to pass the entire context This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495283) Time Spent: 1h 50m (was: 1h 40m) > sys.replication_metrics table shows incorrect status for failed policies > > > Key: HIVE-24227 > URL: https://issues.apache.org/jira/browse/HIVE-24227 > Project: Hive > Issue Type: Bug >Reporter: Arko Sharma >Assignee: Arko Sharma >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24227) sys.replication_metrics table shows incorrect status for failed policies
[ https://issues.apache.org/jira/browse/HIVE-24227?focusedWorklogId=495284&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495284 ] ASF GitHub Bot logged work on HIVE-24227: - Author: ASF GitHub Bot Created on: 05/Oct/20 11:50 Start Date: 05/Oct/20 11:50 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1550: URL: https://github.com/apache/hive/pull/1550#discussion_r499540414 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -280,6 +299,49 @@ public static PathFilter getBootstrapDirectoryFilter(final FileSystem fs) { }; } + public static Path getMetricPath(MessageHandler.Context context, HiveConf hiveConf) throws Exception{ +DumpType dumpType; +Path metricPath = null; +String dumpMetaFile = DumpMetaData.getDmdFileName(); +FileSystem fs = null; +if(context.dmd != null) { + dumpType = context.dmd.getDumpType(); + fs = context.dmd.getDumpFilePath().getFileSystem(hiveConf); + metricPath = context.dmd.getDumpFilePath().getParent(); +} +else { + dumpType = null; + if(context.location != null){ +metricPath = (new Path(context.location)).getParent(); +fs = (new Path(context.location)).getFileSystem(hiveConf); + } +} +//traverse to hiveDumpRoot required by metric-collector +while (metricPath != null && fs != null && dumpType != DumpType.BOOTSTRAP && dumpType != DumpType.INCREMENTAL) { Review comment: this may be error prone. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495284) Time Spent: 2h (was: 1h 50m) > sys.replication_metrics table shows incorrect status for failed policies > > > Key: HIVE-24227 > URL: https://issues.apache.org/jira/browse/HIVE-24227 > Project: Hive > Issue Type: Bug >Reporter: Arko Sharma >Assignee: Arko Sharma >Priority: Major > Labels: pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-20137) Truncate for Transactional tables should use base_x
[ https://issues.apache.org/jira/browse/HIVE-20137?focusedWorklogId=495282&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495282 ] ASF GitHub Bot logged work on HIVE-20137: - Author: ASF GitHub Bot Created on: 05/Oct/20 11:49 Start Date: 05/Oct/20 11:49 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1532: URL: https://github.com/apache/hive/pull/1532#discussion_r499536263 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2999,6 +2980,10 @@ Seems much cleaner if each stmt is identified as a particular HiveOperation (whi compBuilder.setExclusive(); compBuilder.setOperationType(DataOperationType.NO_TXN); break; + case DDL_EXCL_WRITE: +compBuilder.setExclWrite(); Review comment: ExclWrite is gonna block concurrent reads. Is it expected? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495282) Time Spent: 1h (was: 50m) > Truncate for Transactional tables should use base_x > --- > > Key: HIVE-20137 > URL: https://issues.apache.org/jira/browse/HIVE-20137 > Project: Hive > Issue Type: Improvement > Components: Transactions >Affects Versions: 3.0.0 >Reporter: Eugene Koifman >Assignee: Peter Varga >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > This is a follow up to HIVE-19387. > Once we have a lock that blocks writers but not readers (HIVE-19369), it > would make sense to make truncate create a new base_x, where is x is a > writeId in current txn - the same as Insert Overwrite does. > This would mean it can work w/o interfering with existing writers. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24227) sys.replication_metrics table shows incorrect status for failed policies
[ https://issues.apache.org/jira/browse/HIVE-24227?focusedWorklogId=495281&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495281 ] ASF GitHub Bot logged work on HIVE-24227: - Author: ASF GitHub Bot Created on: 05/Oct/20 11:41 Start Date: 05/Oct/20 11:41 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1550: URL: https://github.com/apache/hive/pull/1550#discussion_r499535302 ## File path: ql/src/java/org/apache/hadoop/hive/ql/plan/ReplTxnWork.java ## @@ -92,6 +120,18 @@ public ReplTxnWork(String dbName, String tableName, List partNames, this.operation = type; } + public ReplTxnWork(String dbName, String tableName, List partNames, Review comment: have 2 constructors. one with the dumpDirectory and metricCollector and one without. That way you don't need to change existing code This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495281) Time Spent: 1h 40m (was: 1.5h) > sys.replication_metrics table shows incorrect status for failed policies > > > Key: HIVE-24227 > URL: https://issues.apache.org/jira/browse/HIVE-24227 > Project: Hive > Issue Type: Bug >Reporter: Arko Sharma >Assignee: Arko Sharma >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24227) sys.replication_metrics table shows incorrect status for failed policies
[ https://issues.apache.org/jira/browse/HIVE-24227?focusedWorklogId=495280&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495280 ] ASF GitHub Bot logged work on HIVE-24227: - Author: ASF GitHub Bot Created on: 05/Oct/20 11:38 Start Date: 05/Oct/20 11:38 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1550: URL: https://github.com/apache/hive/pull/1550#discussion_r499533749 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/repl/load/message/AlterDatabaseHandler.java ## @@ -77,9 +79,22 @@ alterDbDesc = new AlterDatabaseSetOwnerDesc(actualDbName, new PrincipalDesc(newDb.getOwnerName(), newDb.getOwnerType()), context.eventOnlyReplicationSpec()); } + Path metricPath = null; + ReplicationMetricCollector metricCollector = null; + try{ +metricPath = ReplUtils.getMetricPath(context, context.hiveConf); Review comment: you are passing the context already. hiveconf is part of that This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495280) Time Spent: 1.5h (was: 1h 20m) > sys.replication_metrics table shows incorrect status for failed policies > > > Key: HIVE-24227 > URL: https://issues.apache.org/jira/browse/HIVE-24227 > Project: Hive > Issue Type: Bug >Reporter: Arko Sharma >Assignee: Arko Sharma >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24227) sys.replication_metrics table shows incorrect status for failed policies
[ https://issues.apache.org/jira/browse/HIVE-24227?focusedWorklogId=495277&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495277 ] ASF GitHub Bot logged work on HIVE-24227: - Author: ASF GitHub Bot Created on: 05/Oct/20 11:30 Start Date: 05/Oct/20 11:30 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1550: URL: https://github.com/apache/hive/pull/1550#discussion_r499529820 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnStatsUpdateTask.java ## @@ -355,8 +360,32 @@ public int execute() { } catch (Exception e) { setException(e); LOG.info("Failed to persist stats in metastore", e); + int errorCode = ErrorMsg.getErrorMsg(e.getMessage()).getErrorCode(); + try { +ReplicationMetricCollector metricCollector = work.getMetricCollector(); +if (errorCode > 4) { Review comment: Same applies to all the tasks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495277) Time Spent: 1h 20m (was: 1h 10m) > sys.replication_metrics table shows incorrect status for failed policies > > > Key: HIVE-24227 > URL: https://issues.apache.org/jira/browse/HIVE-24227 > Project: Hive > Issue Type: Bug >Reporter: Arko Sharma >Assignee: Arko Sharma >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24227) sys.replication_metrics table shows incorrect status for failed policies
[ https://issues.apache.org/jira/browse/HIVE-24227?focusedWorklogId=495276&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495276 ] ASF GitHub Bot logged work on HIVE-24227: - Author: ASF GitHub Bot Created on: 05/Oct/20 11:30 Start Date: 05/Oct/20 11:30 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1550: URL: https://github.com/apache/hive/pull/1550#discussion_r499529679 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -242,14 +250,25 @@ public static String getNonEmpty(String configParam, HiveConf hiveConf, String e return taskList; } + public static List> addTasksForLoadingColStats(ColumnStatistics colStats, + HiveConf conf, + UpdatedMetaDataTracker updatedMetadata, + org.apache.hadoop.hive.metastore.api.Table tableObj, + long writeId) throws IOException, TException{ +return addTasksForLoadingColStats(colStats, conf, updatedMetadata, tableObj, Review comment: Same applies to other places as well This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495276) Time Spent: 1h 10m (was: 1h) > sys.replication_metrics table shows incorrect status for failed policies > > > Key: HIVE-24227 > URL: https://issues.apache.org/jira/browse/HIVE-24227 > Project: Hive > Issue Type: Bug >Reporter: Arko Sharma >Assignee: Arko Sharma >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24227) sys.replication_metrics table shows incorrect status for failed policies
[ https://issues.apache.org/jira/browse/HIVE-24227?focusedWorklogId=495272&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495272 ] ASF GitHub Bot logged work on HIVE-24227: - Author: ASF GitHub Bot Created on: 05/Oct/20 11:25 Start Date: 05/Oct/20 11:25 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1550: URL: https://github.com/apache/hive/pull/1550#discussion_r499526784 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/util/ReplUtils.java ## @@ -242,14 +250,25 @@ public static String getNonEmpty(String configParam, HiveConf hiveConf, String e return taskList; } + public static List> addTasksForLoadingColStats(ColumnStatistics colStats, + HiveConf conf, + UpdatedMetaDataTracker updatedMetadata, + org.apache.hadoop.hive.metastore.api.Table tableObj, + long writeId) throws IOException, TException{ +return addTasksForLoadingColStats(colStats, conf, updatedMetadata, tableObj, Review comment: create a overloaded method. Needn't pass null This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495272) Time Spent: 1h (was: 50m) > sys.replication_metrics table shows incorrect status for failed policies > > > Key: HIVE-24227 > URL: https://issues.apache.org/jira/browse/HIVE-24227 > Project: Hive > Issue Type: Bug >Reporter: Arko Sharma >Assignee: Arko Sharma >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24227) sys.replication_metrics table shows incorrect status for failed policies
[ https://issues.apache.org/jira/browse/HIVE-24227?focusedWorklogId=495270&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495270 ] ASF GitHub Bot logged work on HIVE-24227: - Author: ASF GitHub Bot Created on: 05/Oct/20 11:17 Start Date: 05/Oct/20 11:17 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1550: URL: https://github.com/apache/hive/pull/1550#discussion_r499522906 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/repl/DirCopyTask.java ## @@ -140,7 +142,23 @@ public int execute() { } }); } catch (Exception e) { - throw new SecurityException(ErrorMsg.REPL_RETRY_EXHAUSTED.format(e.getMessage()), e); Review comment: need to check why this task was throwing the exception initially This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495270) Time Spent: 50m (was: 40m) > sys.replication_metrics table shows incorrect status for failed policies > > > Key: HIVE-24227 > URL: https://issues.apache.org/jira/browse/HIVE-24227 > Project: Hive > Issue Type: Bug >Reporter: Arko Sharma >Assignee: Arko Sharma >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24227) sys.replication_metrics table shows incorrect status for failed policies
[ https://issues.apache.org/jira/browse/HIVE-24227?focusedWorklogId=495268&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495268 ] ASF GitHub Bot logged work on HIVE-24227: - Author: ASF GitHub Bot Created on: 05/Oct/20 11:15 Start Date: 05/Oct/20 11:15 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1550: URL: https://github.com/apache/hive/pull/1550#discussion_r499521480 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CopyTask.java ## @@ -103,7 +108,32 @@ protected int copyOnePath(Path fromPath, Path toPath) { } catch (Exception e) { console.printError("Failed with exception " + e.getMessage(), "\n" + StringUtils.stringifyException(e)); - return (1); + LOG.error("CopyTask failed", e); Review comment: exception is not set at the task level This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495268) Time Spent: 40m (was: 0.5h) > sys.replication_metrics table shows incorrect status for failed policies > > > Key: HIVE-24227 > URL: https://issues.apache.org/jira/browse/HIVE-24227 > Project: Hive > Issue Type: Bug >Reporter: Arko Sharma >Assignee: Arko Sharma >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24227) sys.replication_metrics table shows incorrect status for failed policies
[ https://issues.apache.org/jira/browse/HIVE-24227?focusedWorklogId=495267&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495267 ] ASF GitHub Bot logged work on HIVE-24227: - Author: ASF GitHub Bot Created on: 05/Oct/20 11:14 Start Date: 05/Oct/20 11:14 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1550: URL: https://github.com/apache/hive/pull/1550#discussion_r499521480 ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/CopyTask.java ## @@ -103,7 +108,32 @@ protected int copyOnePath(Path fromPath, Path toPath) { } catch (Exception e) { console.printError("Failed with exception " + e.getMessage(), "\n" + StringUtils.stringifyException(e)); - return (1); + LOG.error("CopyTask failed", e); Review comment: exception is not set This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495267) Time Spent: 0.5h (was: 20m) > sys.replication_metrics table shows incorrect status for failed policies > > > Key: HIVE-24227 > URL: https://issues.apache.org/jira/browse/HIVE-24227 > Project: Hive > Issue Type: Bug >Reporter: Arko Sharma >Assignee: Arko Sharma >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24227) sys.replication_metrics table shows incorrect status for failed policies
[ https://issues.apache.org/jira/browse/HIVE-24227?focusedWorklogId=495266&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495266 ] ASF GitHub Bot logged work on HIVE-24227: - Author: ASF GitHub Bot Created on: 05/Oct/20 11:13 Start Date: 05/Oct/20 11:13 Worklog Time Spent: 10m Work Description: aasha commented on a change in pull request #1550: URL: https://github.com/apache/hive/pull/1550#discussion_r499519404 ## File path: ql/src/java/org/apache/hadoop/hive/ql/ddl/DDLTask.java ## @@ -82,8 +89,32 @@ public int execute() { throw new IllegalArgumentException("Unknown DDL request: " + ddlDesc.getClass()); } } catch (Throwable e) { + LOG.error("DDLTask failed", e); + int errorCode = ErrorMsg.getErrorMsg(e.getMessage()).getErrorCode(); + try { +ReplicationMetricCollector metricCollector = work.getMetricCollector(); +if (errorCode > 4) { + //in case of replication related task, dumpDirectory should not be null + if(work.dumpDirectory != null) { +Path nonRecoverableMarker = new Path(work.dumpDirectory, ReplAck.NON_RECOVERABLE_MARKER.toString()); +org.apache.hadoop.hive.ql.parse.repl.dump.Utils.writeStackTrace(e, nonRecoverableMarker, conf); +if(metricCollector != null){ + metricCollector.reportStageEnd(getName(), Status.FAILED_ADMIN, nonRecoverableMarker.toString()); +} + } + if(metricCollector != null){ Review comment: this is needed only in replication case ## File path: ql/src/java/org/apache/hadoop/hive/ql/ddl/DDLTask.java ## @@ -82,8 +89,32 @@ public int execute() { throw new IllegalArgumentException("Unknown DDL request: " + ddlDesc.getClass()); } } catch (Throwable e) { + LOG.error("DDLTask failed", e); + int errorCode = ErrorMsg.getErrorMsg(e.getMessage()).getErrorCode(); + try { +ReplicationMetricCollector metricCollector = work.getMetricCollector(); +if (errorCode > 4) { + //in case of replication related task, dumpDirectory should not be null + if(work.dumpDirectory != null) { +Path nonRecoverableMarker = new Path(work.dumpDirectory, ReplAck.NON_RECOVERABLE_MARKER.toString()); +org.apache.hadoop.hive.ql.parse.repl.dump.Utils.writeStackTrace(e, nonRecoverableMarker, conf); +if(metricCollector != null){ + metricCollector.reportStageEnd(getName(), Status.FAILED_ADMIN, nonRecoverableMarker.toString()); +} + } + if(metricCollector != null){ +metricCollector.reportStageEnd(getName(), Status.FAILED_ADMIN, null); + } +} else { + if(metricCollector != null){ +work.getMetricCollector().reportStageEnd(getName(), Status.FAILED); Review comment: use metricCollector directly ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnStatsUpdateTask.java ## @@ -355,8 +360,32 @@ public int execute() { } catch (Exception e) { setException(e); LOG.info("Failed to persist stats in metastore", e); + int errorCode = ErrorMsg.getErrorMsg(e.getMessage()).getErrorCode(); + try { +ReplicationMetricCollector metricCollector = work.getMetricCollector(); +if (errorCode > 4) { Review comment: All this code can be part of a util method. ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/MoveTask.java ## @@ -464,14 +468,63 @@ public int execute() { console.printInfo("\n", StringUtils.stringifyException(he),false); } } - setException(he); + LOG.error("MoveTask failed", he); + errorCode = ErrorMsg.getErrorMsg(he.getMessage()).getErrorCode(); + try { +ReplicationMetricCollector metricCollector = work.getMetricCollector(); Review comment: util method ## File path: ql/src/java/org/apache/hadoop/hive/ql/ddl/DDLTask.java ## @@ -82,8 +89,32 @@ public int execute() { throw new IllegalArgumentException("Unknown DDL request: " + ddlDesc.getClass()); } } catch (Throwable e) { + LOG.error("DDLTask failed", e); Review comment: print the DDL operation too. DDL task can be called for different operation ## File path: ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnStatsUpdateTask.java ## @@ -355,8 +360,32 @@ public int execute() { } catch (Exception e) { setException(e); LOG.info("Failed to persist stats in metastore", e); + int errorCode = ErrorMsg.getErrorMsg(e.getMessage()).getErrorCode(); + try { +ReplicationMetricCollector metricCollector = work.getMetricCollector(); +if (errorCode > 4) { + //in case of repl
[jira] [Commented] (HIVE-24205) Optimise CuckooSetBytes
[ https://issues.apache.org/jira/browse/HIVE-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207989#comment-17207989 ] Rajesh Balamohan commented on HIVE-24205: - Thanks [~mustafaiman]. With repeated runs (i.e without any data miss), I see around 9-10% improvement with the PR. This is based on a small 5 node LLAP cluster with TPCH12 (43.82 seconds vs 39.01 seconds). Tried with "select count(*) from lineitem where l_shipmode in ('REG AIR', 'MAIL');" which showed much better improvement with and without PR (10.94 seconds vs 8.49 seconds). > Optimise CuckooSetBytes > --- > > Key: HIVE-24205 > URL: https://issues.apache.org/jira/browse/HIVE-24205 > Project: Hive > Issue Type: Improvement >Reporter: Rajesh Balamohan >Assignee: Mustafa Iman >Priority: Major > Labels: pull-request-available > Attachments: Screenshot 2020-09-28 at 4.29.24 PM.png, bench.png, > vectorized.patch > > Time Spent: 10m > Remaining Estimate: 0h > > {{FilterStringColumnInList, StringColumnInList}} etc use CuckooSetBytes for > lookup. > !Screenshot 2020-09-28 at 4.29.24 PM.png|width=714,height=508! > One option to optimize would be to add boundary conditions on "length" with > the min/max length stored in the hashes (ref: > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/CuckooSetBytes.java#L85]) > . This would significantly reduce the number of hash computation that needs > to happen. E.g > [TPCH-Q12|https://github.com/hortonworks/hive-testbench/blob/hdp3/sample-queries-tpch/tpch_query12.sql#L20] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495235&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495235 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 09:49 Start Date: 05/Oct/20 09:49 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499475796 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/CompactionTxnHandler.java ## @@ -107,11 +107,12 @@ public CompactionTxnHandler() { // Check for aborted txns: number of aborted txns past threshold and age of aborted txns // past time threshold boolean checkAbortedTimeThreshold = abortedTimeThreshold >= 0; -final String sCheckAborted = "SELECT \"TC_DATABASE\", \"TC_TABLE\", \"TC_PARTITION\"," -+ "MIN(\"TXN_STARTED\"), COUNT(*)" +String sCheckAborted = "SELECT \"TC_DATABASE\", \"TC_TABLE\", \"TC_PARTITION\", " ++ "MIN(\"TXN_STARTED\"), COUNT(*), " ++ "MAX(CASE WHEN \"TC_OPERATION_TYPE\" = " + OperationType.DYNPART + " THEN 1 ELSE 0 END) AS \"IS_DP\" " Review comment: I might be mistaken here, but does this mean, that if we have many "normal" aborted txn and 1 aborted dynpart txn, we will not initiate a normal compaction until the dynpart stuff is not cleaned up? Is this ok, shouldn't we doing both? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495235) Time Spent: 2h 40m (was: 2.5h) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 2h 40m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495226&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495226 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 09:20 Start Date: 05/Oct/20 09:20 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499457494 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactorMR.java ## @@ -237,6 +237,7 @@ void run(HiveConf conf, String jobName, Table t, Partition p, StorageDescriptor } JobConf job = createBaseJobConf(conf, jobName, t, sd, writeIds, ci); +QueryCompactor.Util.removeAbortedDirsForAcidTable(conf, dir); Review comment: @vpnvishv Why do we do this here? I understand we can, but why don't we let the Cleaner to delete the files? This just makes the compactor slower. Do we have a functionality reason for this? After this change it will run in CompactorMR and in MMQueryCompactors, but not in normal QueryCompactors? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495226) Time Spent: 2.5h (was: 2h 20m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 2.5h > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495221&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495221 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 09:08 Start Date: 05/Oct/20 09:08 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499449636 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java ## @@ -97,9 +100,9 @@ public void run() { long minOpenTxnId = txnHandler.findMinOpenTxnIdForCleaner(); LOG.info("Cleaning based on min open txn id: " + minOpenTxnId); List cleanerList = new ArrayList<>(); - for(CompactionInfo compactionInfo : txnHandler.findReadyToClean()) { + for (CompactionInfo compactionInfo : txnHandler.findReadyToClean()) { cleanerList.add(CompletableFuture.runAsync(CompactorUtil.ThrowingRunnable.unchecked(() -> -clean(compactionInfo, minOpenTxnId)), cleanerExecutor)); + clean(compactionInfo, minOpenTxnId)), cleanerExecutor)); Review comment: Two questions here: 1. In the original Jira there was discussion about not allowing concurrent cleanings of the same stuff (partition / table). Should we worry about this? 2. The slow cleanAborted will clog the executor service, we should do something about this, either in this patch, or follow up something like https://issues.apache.org/jira/browse/HIVE-21150 immediately after this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495221) Time Spent: 2h 20m (was: 2h 10m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 2h 20m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HIVE-24193) Select query on renamed hive acid table does not produce any output
[ https://issues.apache.org/jira/browse/HIVE-24193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Haindrich resolved HIVE-24193. - Fix Version/s: 4.0.0 Resolution: Fixed merged into master. Thank you [~Rajkumar Singh] for fixing this ; and Peter for reviewing the changes! > Select query on renamed hive acid table does not produce any output > --- > > Key: HIVE-24193 > URL: https://issues.apache.org/jira/browse/HIVE-24193 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 3.1.2 >Reporter: Rajkumar Singh >Assignee: Rajkumar Singh >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > During onRename, HMS update COMPLETED_TXN_COMPONENTS which fail with > CTC_DATABASE column does not exist, upon investigation I found that enclosing > quotes are missing for columns thats db query fail with this exception > Steps to repro: > 1. create table test(id int); > 2. insert into table test values(1); > 3. alter table test rename to test1; > 3. select * from test1 produce no output -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24193) Select query on renamed hive acid table does not produce any output
[ https://issues.apache.org/jira/browse/HIVE-24193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-24193: -- Labels: pull-request-available (was: ) > Select query on renamed hive acid table does not produce any output > --- > > Key: HIVE-24193 > URL: https://issues.apache.org/jira/browse/HIVE-24193 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 3.1.2 >Reporter: Rajkumar Singh >Assignee: Rajkumar Singh >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > During onRename, HMS update COMPLETED_TXN_COMPONENTS which fail with > CTC_DATABASE column does not exist, upon investigation I found that enclosing > quotes are missing for columns thats db query fail with this exception > Steps to repro: > 1. create table test(id int); > 2. insert into table test values(1); > 3. alter table test rename to test1; > 3. select * from test1 produce no output -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24193) Select query on renamed hive acid table does not produce any output
[ https://issues.apache.org/jira/browse/HIVE-24193?focusedWorklogId=495207&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495207 ] ASF GitHub Bot logged work on HIVE-24193: - Author: ASF GitHub Bot Created on: 05/Oct/20 08:02 Start Date: 05/Oct/20 08:02 Worklog Time Spent: 10m Work Description: kgyrtkirk merged pull request #1520: URL: https://github.com/apache/hive/pull/1520 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495207) Remaining Estimate: 0h Time Spent: 10m > Select query on renamed hive acid table does not produce any output > --- > > Key: HIVE-24193 > URL: https://issues.apache.org/jira/browse/HIVE-24193 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 3.1.2 >Reporter: Rajkumar Singh >Assignee: Rajkumar Singh >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > During onRename, HMS update COMPLETED_TXN_COMPONENTS which fail with > CTC_DATABASE column does not exist, upon investigation I found that enclosing > quotes are missing for columns thats db query fail with this exception > Steps to repro: > 1. create table test(id int); > 2. insert into table test values(1); > 3. alter table test rename to test1; > 3. select * from test1 produce no output -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495204&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495204 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 07:56 Start Date: 05/Oct/20 07:56 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499405728 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { + return true; +} else if (name.startsWith(baseDir(wId)) && !name.contains("=")) { + return true; +} + } + return false; +}; +List deleted = new ArrayList<>(); +deleteDeltaDirectoriesAux(rootPartition, fs, filter, deleted); +return deleted; + } + + private static void deleteDeltaDirectoriesAux(Path root, FileSystem fs, PathFilter filter, List deleted) + throws IOException { +RemoteIterator it = listIterator(fs, root, null); + +while (it.hasNext()) { + FileStatus fStatus = it.next(); + if (fStatus.isDirectory()) { +if (filter.accept(fStatus.getPath())) { + fs.delete(fStatus.getPath(), true); + deleted.add(fStatus); +} else { + deleteDeltaDirectoriesAux(fStatus.getPath(), fs, filter, deleted); + if (isDirectoryEmpty(fs, fStatus.getPath())) { +fs.delete(fStatus.getPath(), false); +deleted.add(fStatus); + } +} + } +} + } + + private static boolean isDirectoryEmpty(FileSystem fs, Path path) throws IOException { +RemoteIterator it = listIterator(fs, path, null); +return !it.hasNext(); + } + + private static RemoteIterator listIterator(FileSystem fs, Path path, PathFilter filter) + throws IOException { +try { + return new ToFileStatusIterator(SHIMS.listLocatedHdfsStatusIterator(fs, path, filter)); +} catch (Throwable t) { Review comment: This should be similar to tryListLocatedHdfsStatus don't catch all Throwable. And maybe add all this to the HdfsUtils class This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495204) Time Spent: 2h 10m (was: 2h) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 2h 10m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495203&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495203 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 07:56 Start Date: 05/Oct/20 07:56 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499405728 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { + return true; +} else if (name.startsWith(baseDir(wId)) && !name.contains("=")) { + return true; +} + } + return false; +}; +List deleted = new ArrayList<>(); +deleteDeltaDirectoriesAux(rootPartition, fs, filter, deleted); +return deleted; + } + + private static void deleteDeltaDirectoriesAux(Path root, FileSystem fs, PathFilter filter, List deleted) + throws IOException { +RemoteIterator it = listIterator(fs, root, null); + +while (it.hasNext()) { + FileStatus fStatus = it.next(); + if (fStatus.isDirectory()) { +if (filter.accept(fStatus.getPath())) { + fs.delete(fStatus.getPath(), true); + deleted.add(fStatus); +} else { + deleteDeltaDirectoriesAux(fStatus.getPath(), fs, filter, deleted); + if (isDirectoryEmpty(fs, fStatus.getPath())) { +fs.delete(fStatus.getPath(), false); +deleted.add(fStatus); + } +} + } +} + } + + private static boolean isDirectoryEmpty(FileSystem fs, Path path) throws IOException { +RemoteIterator it = listIterator(fs, path, null); +return !it.hasNext(); + } + + private static RemoteIterator listIterator(FileSystem fs, Path path, PathFilter filter) + throws IOException { +try { + return new ToFileStatusIterator(SHIMS.listLocatedHdfsStatusIterator(fs, path, filter)); +} catch (Throwable t) { Review comment: This should be similar to tryListLocatedHdfsStatus don't catch all Throwable This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495203) Time Spent: 2h (was: 1h 50m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 2h > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495202&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495202 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 07:53 Start Date: 05/Oct/20 07:53 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499404466 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { + return true; +} else if (name.startsWith(baseDir(wId)) && !name.contains("=")) { + return true; +} + } + return false; +}; +List deleted = new ArrayList<>(); +deleteDeltaDirectoriesAux(rootPartition, fs, filter, deleted); +return deleted; + } + + private static void deleteDeltaDirectoriesAux(Path root, FileSystem fs, PathFilter filter, List deleted) + throws IOException { +RemoteIterator it = listIterator(fs, root, null); + +while (it.hasNext()) { + FileStatus fStatus = it.next(); + if (fStatus.isDirectory()) { +if (filter.accept(fStatus.getPath())) { + fs.delete(fStatus.getPath(), true); + deleted.add(fStatus); +} else { + deleteDeltaDirectoriesAux(fStatus.getPath(), fs, filter, deleted); + if (isDirectoryEmpty(fs, fStatus.getPath())) { Review comment: Are we doing this to delete newly created partitions if there are no other writes? Is this ok, what if we found a valid empty partition that is registered in the HMS? We should not delete that. I think this can be skipped all together, the empty partition dir will not bother anybody This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495202) Time Spent: 1h 50m (was: 1h 40m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 1h 50m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495201&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495201 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 07:50 Start Date: 05/Oct/20 07:50 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499402826 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { Review comment: I am wondering are we covering all the use cases here? Is it possible that this dynamic part query was writing to an existing partition with existing older writes and a compaction was running before we managed to delete the aborted delta? I think in this case sadly, we still going to read the aborted data as valid. Could you add a test case to check if it is indeed a problem or not? (I do not have an idea for a solution...) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495201) Time Spent: 1h 40m (was: 1.5h) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 1h 40m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495199&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495199 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 07:44 Start Date: 05/Oct/20 07:44 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499399709 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { Review comment: Why the contains "=", are we checking for a partition where the user named the column exactly like a valid delta dir? I don't think we should support that This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495199) Time Spent: 1.5h (was: 1h 20m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24228) Support complex types in LLAP
[ https://issues.apache.org/jira/browse/HIVE-24228?focusedWorklogId=495194&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495194 ] ASF GitHub Bot logged work on HIVE-24228: - Author: ASF GitHub Bot Created on: 05/Oct/20 07:26 Start Date: 05/Oct/20 07:26 Worklog Time Spent: 10m Work Description: bymm opened a new pull request #1551: URL: https://github.com/apache/hive/pull/1551 ### What changes were proposed in this pull request? The idea of this improvement is to support complex types (arrays, maps, structs) returned from LLAP data reader. This is useful when consuming LLAP data later in Spark. ### Why are the changes needed? When consuming data from LLAP, it should support all Hive types including the complex ones. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It was tested on the tables with complex types. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495194) Remaining Estimate: 0h Time Spent: 10m > Support complex types in LLAP > - > > Key: HIVE-24228 > URL: https://issues.apache.org/jira/browse/HIVE-24228 > Project: Hive > Issue Type: Improvement > Components: Hive >Reporter: Yuriy Baltovskyy >Assignee: Yuriy Baltovskyy >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The idea of this improvement is to support complex types (arrays, maps, > structs) returned from LLAP data reader. This is useful when consuming LLAP > data later in Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24228) Support complex types in LLAP
[ https://issues.apache.org/jira/browse/HIVE-24228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-24228: -- Labels: pull-request-available (was: ) > Support complex types in LLAP > - > > Key: HIVE-24228 > URL: https://issues.apache.org/jira/browse/HIVE-24228 > Project: Hive > Issue Type: Improvement > Components: Hive >Reporter: Yuriy Baltovskyy >Assignee: Yuriy Baltovskyy >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The idea of this improvement is to support complex types (arrays, maps, > structs) returned from LLAP data reader. This is useful when consuming LLAP > data later in Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-21052) Make sure transactions get cleaned if they are aborted before addPartitions is called
[ https://issues.apache.org/jira/browse/HIVE-21052?focusedWorklogId=495191&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-495191 ] ASF GitHub Bot logged work on HIVE-21052: - Author: ASF GitHub Bot Created on: 05/Oct/20 07:21 Start Date: 05/Oct/20 07:21 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1548: URL: https://github.com/apache/hive/pull/1548#discussion_r499388356 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/AcidUtils.java ## @@ -2839,6 +2848,87 @@ public static void setNonTransactional(Map tblProps) { tblProps.remove(hive_metastoreConstants.TABLE_TRANSACTIONAL_PROPERTIES); } + /** + * Look for delta directories matching the list of writeIds and deletes them. + * @param rootPartition root partition to look for the delta directories + * @param conf configuration + * @param writeIds list of writeIds to look for in the delta directories + * @return list of deleted directories. + * @throws IOException + */ + public static List deleteDeltaDirectories(Path rootPartition, Configuration conf, Set writeIds) + throws IOException { +FileSystem fs = rootPartition.getFileSystem(conf); + +PathFilter filter = (p) -> { + String name = p.getName(); + for (Long wId : writeIds) { +if (name.startsWith(deltaSubdir(wId, wId)) && !name.contains("=")) { + return true; +} else if (name.startsWith(baseDir(wId)) && !name.contains("=")) { + return true; +} + } + return false; +}; +List deleted = new ArrayList<>(); +deleteDeltaDirectoriesAux(rootPartition, fs, filter, deleted); +return deleted; + } + + private static void deleteDeltaDirectoriesAux(Path root, FileSystem fs, PathFilter filter, List deleted) Review comment: This is going issue many filesystem listing on a table with many partitions, that is going to be very slow on S3. I think you should consider changing this logic to be similar to getHdfsDirSnapshots that would do 1 recursive listing, iterate all the files and collect the deltas that needs to be deleted and delete them at the end (possible concurrently) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 495191) Time Spent: 1h 20m (was: 1h 10m) > Make sure transactions get cleaned if they are aborted before addPartitions > is called > - > > Key: HIVE-21052 > URL: https://issues.apache.org/jira/browse/HIVE-21052 > Project: Hive > Issue Type: Bug > Components: Transactions >Affects Versions: 3.0.0, 3.1.1 >Reporter: Jaume M >Assignee: Jaume M >Priority: Critical > Labels: pull-request-available > Attachments: Aborted Txn w_Direct Write.pdf, HIVE-21052.1.patch, > HIVE-21052.10.patch, HIVE-21052.11.patch, HIVE-21052.12.patch, > HIVE-21052.2.patch, HIVE-21052.3.patch, HIVE-21052.4.patch, > HIVE-21052.5.patch, HIVE-21052.6.patch, HIVE-21052.7.patch, > HIVE-21052.8.patch, HIVE-21052.9.patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > If the transaction is aborted between openTxn and addPartitions and data has > been written on the table the transaction manager will think it's an empty > transaction and no cleaning will be done. > This is currently an issue in the streaming API and in micromanaged tables. > As proposed by [~ekoifman] this can be solved by: > * Writing an entry with a special marker to TXN_COMPONENTS at openTxn and > when addPartitions is called remove this entry from TXN_COMPONENTS and add > the corresponding partition entry to TXN_COMPONENTS. > * If the cleaner finds and entry with a special marker in TXN_COMPONENTS that > specifies that a transaction was opened and it was aborted it must generate > jobs for the worker for every possible partition available. > cc [~ewohlstadter] -- This message was sent by Atlassian Jira (v8.3.4#803005)