[jira] [Comment Edited] (HIVE-24920) TRANSLATED_TO_EXTERNAL tables may write to the same location
[ https://issues.apache.org/jira/browse/HIVE-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524983#comment-17524983 ] tanghui edited comment on HIVE-24920 at 4/21/22 12:41 AM: -- After the patch is updated, the partition table location and hdfs data directory are displayed normally, but the partition location of the table in the SDS in the Hive metabase is still displayed as the location of the old table, resulting in no data in the query partition. in beeline: set hive.create.as.external.legacy=true; CREATE TABLE part_test( c1 string ,c2 string )PARTITIONED BY (dat string) insert into part_test values ("11","th","20220101") insert into part_test values ("22","th","20220102") alter table part_test rename to part_test11; --this result is null. select * from part_test11 where dat="20220101"; ||part_test.c1||part_test.c2||part_test.dat|| | | | | - SDS in the Hive metabase: select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND TBLS.TBL_ID=SDS.CD_ID; --- |*LOCATION*| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102| --- We need to modify the partition location of the table in SDS to ensure that the query results are normal: https://issues.apache.org/jira/browse/HIVE-26158 was (Author: sanguines): After the patch is updated, the partition table location and hdfs data directory are displayed normally, but the partition location of the table in the SDS in the Hive metabase is still displayed as the location of the old table, resulting in no data in the query partition. in beeline: set hive.create.as.external.legacy=true; CREATE TABLE part_test( c1 string ,c2 string )PARTITIONED BY (dat string) insert into part_test values ("11","th","20220101") insert into part_test values ("22","th","20220102") alter table part_test rename to part_test11; --this result is null. select * from part_test11 where dat="20220101"; ||part_test.c1||part_test.c2||part_test.dat|| | | | | - SDS in the Hive metabase: select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND TBLS.TBL_ID=SDS.CD_ID; --- |*LOCATION*| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102| --- We need to modify the partition location of the table in SDS to ensure that the query results are normal > TRANSLATED_TO_EXTERNAL tables may write to the same location > > > Key: HIVE-24920 > URL: https://issues.apache.org/jira/browse/HIVE-24920 > Project: Hive > Issue Type: Bug >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: metastore_translator, pull-request-available > Fix For: 4.0.0, 4.0.0-alpha-1 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > {code} > create table t (a integer); > insert into t values(1); > alter table t rename to t2; > create table t (a integer); -- I expected an exception from this command > (location already exists) but because its an external table no exception > insert into t values(2); > select * from t; -- shows 1 and 2 > drop table t2;-- wipes out data location > select * from t; -- empty resultset > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-24969) Predicates may be removed when decorrelating subqueries with lateral
[ https://issues.apache.org/jira/browse/HIVE-24969?focusedWorklogId=759665=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759665 ] ASF GitHub Bot logged work on HIVE-24969: - Author: ASF GitHub Bot Created on: 21/Apr/22 00:21 Start Date: 21/Apr/22 00:21 Worklog Time Spent: 10m Work Description: github-actions[bot] closed pull request #3018: HIVE-24969: Predicates may be removed when decorrelating subqueries with lateral URL: https://github.com/apache/hive/pull/3018 Issue Time Tracking --- Worklog Id: (was: 759665) Time Spent: 3h (was: 2h 50m) > Predicates may be removed when decorrelating subqueries with lateral > > > Key: HIVE-24969 > URL: https://issues.apache.org/jira/browse/HIVE-24969 > Project: Hive > Issue Type: Bug > Components: Logical Optimizer >Reporter: Zhihua Deng >Assignee: Zhihua Deng >Priority: Major > Labels: pull-request-available > Time Spent: 3h > Remaining Estimate: 0h > > Step to reproduce: > {code:java} > select count(distinct logItem.triggerId) > from service_stat_log LATERAL VIEW explode(logItems) LogItemTable AS logItem > where logItem.dsp in ('delivery', 'ocpa') > and logItem.iswin = true > and logItem.adid in ( > select distinct adId > from ad_info > where subAccountId in (16010, 14863)); {code} > For predicates _logItem.dsp in ('delivery', 'ocpa')_ and _logItem.iswin = > true_ are removed when doing ppd: JOIN -> RS -> LVJ. The JOIN has > candicates: logitem -> [logItem.dsp in ('delivery', 'ocpa'), logItem.iswin = > true],when pushing them to the RS followed by LVJ, none of them are pushed, > the candicates of logitem are removed finally by default, which cause to the > wrong result. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26072) Enable vectorization for stats gathering (tablescan op)
[ https://issues.apache.org/jira/browse/HIVE-26072?focusedWorklogId=759621=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759621 ] ASF GitHub Bot logged work on HIVE-26072: - Author: ASF GitHub Bot Created on: 20/Apr/22 22:12 Start Date: 20/Apr/22 22:12 Worklog Time Spent: 10m Work Description: ramesh0201 opened a new pull request, #3228: URL: https://github.com/apache/hive/pull/3228 …ROugh patch(Do Not merge) ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Issue Time Tracking --- Worklog Id: (was: 759621) Time Spent: 0.5h (was: 20m) > Enable vectorization for stats gathering (tablescan op) > --- > > Key: HIVE-26072 > URL: https://issues.apache.org/jira/browse/HIVE-26072 > Project: Hive > Issue Type: Bug > Components: Hive >Reporter: Rajesh Balamohan >Assignee: Ayush Saxena >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > https://issues.apache.org/jira/browse/HIVE-24510 enabled vectorization for > compute_bit_vector. > But tablescan operator for stats gathering is disabled by default. > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java#L2577] > Need to enable vectorization for this. This can significantly reduce runtimes > for analyze statements for large tables. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-21456) Hive Metastore Thrift over HTTP
[ https://issues.apache.org/jira/browse/HIVE-21456?focusedWorklogId=759515=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759515 ] ASF GitHub Bot logged work on HIVE-21456: - Author: ASF GitHub Bot Created on: 20/Apr/22 19:33 Start Date: 20/Apr/22 19:33 Worklog Time Spent: 10m Work Description: sourabh912 commented on PR #3105: URL: https://github.com/apache/hive/pull/3105#issuecomment-1104380804 Thanks for the review @pvary @nrg4878 @yongzhi and @saihemanth-cloudera Issue Time Tracking --- Worklog Id: (was: 759515) Time Spent: 6h 50m (was: 6h 40m) > Hive Metastore Thrift over HTTP > --- > > Key: HIVE-21456 > URL: https://issues.apache.org/jira/browse/HIVE-21456 > Project: Hive > Issue Type: New Feature > Components: Metastore, Standalone Metastore >Reporter: Amit Khanna >Assignee: Sourabh Goyal >Priority: Major > Labels: pull-request-available > Attachments: HIVE-21456.2.patch, HIVE-21456.3.patch, > HIVE-21456.4.patch, HIVE-21456.patch > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Hive Metastore currently doesn't have support for HTTP transport because of > which it is not possible to access it via Knox. Adding support for Thrift > over HTTP transport will allow the clients to access via Knox -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (HIVE-26074) PTF Vectorization: BoundaryScanner for varchar
[ https://issues.apache.org/jira/browse/HIVE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena resolved HIVE-26074. - Fix Version/s: 4.0.0-alpha-2 Assignee: Ayush Saxena (was: László Bodor) Resolution: Fixed > PTF Vectorization: BoundaryScanner for varchar > -- > > Key: HIVE-26074 > URL: https://issues.apache.org/jira/browse/HIVE-26074 > Project: Hive > Issue Type: Bug >Reporter: László Bodor >Assignee: Ayush Saxena >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0-alpha-2 > > Time Spent: 1.5h > Remaining Estimate: 0h > > HIVE-24761 should be extended for varchar, otherwise it fails on varchar type > {code} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Internal Error: > attempt to setup a Window for typeString: 'varchar(170)' > at > org.apache.hadoop.hive.ql.udf.ptf.SingleValueBoundaryScanner.getBoundaryScanner(ValueBoundaryScanner.java:773) > at > org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner$MultiPrimitiveValueBoundaryScanner. (ValueBoundaryScanner.java:1257) > at > org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:1237) > at > org.apache.hadoop.hive.ql.udf.ptf.ValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:327) > at > org.apache.hadoop.hive.ql.udf.ptf.PTFRangeUtil.getRange(PTFRangeUtil.java:40) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFGroupBatches.finishPartition(VectorPTFGroupBatches.java:442) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.finishPartition(VectorPTFOperator.java:631) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.closeOp(VectorPTFOperator.java:782) > at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:731) > at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:755) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:383) > ... 16 more > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HIVE-26074) PTF Vectorization: BoundaryScanner for varchar
[ https://issues.apache.org/jira/browse/HIVE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525179#comment-17525179 ] Ayush Saxena commented on HIVE-26074: - Merged PR to master. Thanx [~abstractdog] for the review!!! > PTF Vectorization: BoundaryScanner for varchar > -- > > Key: HIVE-26074 > URL: https://issues.apache.org/jira/browse/HIVE-26074 > Project: Hive > Issue Type: Bug >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > HIVE-24761 should be extended for varchar, otherwise it fails on varchar type > {code} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Internal Error: > attempt to setup a Window for typeString: 'varchar(170)' > at > org.apache.hadoop.hive.ql.udf.ptf.SingleValueBoundaryScanner.getBoundaryScanner(ValueBoundaryScanner.java:773) > at > org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner$MultiPrimitiveValueBoundaryScanner. (ValueBoundaryScanner.java:1257) > at > org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:1237) > at > org.apache.hadoop.hive.ql.udf.ptf.ValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:327) > at > org.apache.hadoop.hive.ql.udf.ptf.PTFRangeUtil.getRange(PTFRangeUtil.java:40) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFGroupBatches.finishPartition(VectorPTFGroupBatches.java:442) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.finishPartition(VectorPTFOperator.java:631) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.closeOp(VectorPTFOperator.java:782) > at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:731) > at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:755) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:383) > ... 16 more > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26074) PTF Vectorization: BoundaryScanner for varchar
[ https://issues.apache.org/jira/browse/HIVE-26074?focusedWorklogId=759469=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759469 ] ASF GitHub Bot logged work on HIVE-26074: - Author: ASF GitHub Bot Created on: 20/Apr/22 18:25 Start Date: 20/Apr/22 18:25 Worklog Time Spent: 10m Work Description: ayushtkn merged PR #3187: URL: https://github.com/apache/hive/pull/3187 Issue Time Tracking --- Worklog Id: (was: 759469) Time Spent: 1.5h (was: 1h 20m) > PTF Vectorization: BoundaryScanner for varchar > -- > > Key: HIVE-26074 > URL: https://issues.apache.org/jira/browse/HIVE-26074 > Project: Hive > Issue Type: Bug >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > HIVE-24761 should be extended for varchar, otherwise it fails on varchar type > {code} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Internal Error: > attempt to setup a Window for typeString: 'varchar(170)' > at > org.apache.hadoop.hive.ql.udf.ptf.SingleValueBoundaryScanner.getBoundaryScanner(ValueBoundaryScanner.java:773) > at > org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner$MultiPrimitiveValueBoundaryScanner. (ValueBoundaryScanner.java:1257) > at > org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:1237) > at > org.apache.hadoop.hive.ql.udf.ptf.ValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:327) > at > org.apache.hadoop.hive.ql.udf.ptf.PTFRangeUtil.getRange(PTFRangeUtil.java:40) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFGroupBatches.finishPartition(VectorPTFGroupBatches.java:442) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.finishPartition(VectorPTFOperator.java:631) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.closeOp(VectorPTFOperator.java:782) > at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:731) > at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:755) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:383) > ... 16 more > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HIVE-26159) hive cli is unavailable from hive command
[ https://issues.apache.org/jira/browse/HIVE-26159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-26159: -- Labels: pull-request-available (was: ) > hive cli is unavailable from hive command > - > > Key: HIVE-26159 > URL: https://issues.apache.org/jira/browse/HIVE-26159 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0-alpha-1 >Reporter: Wechar >Assignee: Wechar >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Hive cli is a convenient tool to connect to hive metastore service, but now > hive cli can not start even if we use *--service cli* option, it should be a > bug of ticket HIVE-24348. > *Steps to reproduce:* > {code:bash} > hive@hive:/root$ /usr/share/hive/bin/hive --service cli --hiveconf > hive.metastore.uris=thrift://hive:9084 > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] > Beeline version 4.0.0-alpha-2-SNAPSHOT by Apache Hive > beeline> > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26159) hive cli is unavailable from hive command
[ https://issues.apache.org/jira/browse/HIVE-26159?focusedWorklogId=759361=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759361 ] ASF GitHub Bot logged work on HIVE-26159: - Author: ASF GitHub Bot Created on: 20/Apr/22 16:18 Start Date: 20/Apr/22 16:18 Worklog Time Spent: 10m Work Description: wecharyu opened a new pull request, #3227: URL: https://github.com/apache/hive/pull/3227 ### What changes were proposed in this pull request? Hive cli is the default service in hive script, but it can not start even if we use `--service cli` option now, I don't think this is what we expect from beeline. ### Why are the changes needed? Hive cli is a convenient tool to connect to hive metastore service for testing or hive taste, and the hive community seems not intend to deprecate hive cli so far. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Shell script, can test on local with just `hive` command: ```bash $ $HIVE_HOME/bin/hive ``` Issue Time Tracking --- Worklog Id: (was: 759361) Remaining Estimate: 0h Time Spent: 10m > hive cli is unavailable from hive command > - > > Key: HIVE-26159 > URL: https://issues.apache.org/jira/browse/HIVE-26159 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0-alpha-1 >Reporter: Wechar >Assignee: Wechar >Priority: Major > Fix For: 4.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Hive cli is a convenient tool to connect to hive metastore service, but now > hive cli can not start even if we use *--service cli* option, it should be a > bug of ticket HIVE-24348. > *Steps to reproduce:* > {code:bash} > hive@hive:/root$ /usr/share/hive/bin/hive --service cli --hiveconf > hive.metastore.uris=thrift://hive:9084 > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] > Beeline version 4.0.0-alpha-2-SNAPSHOT by Apache Hive > beeline> > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HIVE-26159) hive cli is unavailable from hive command
[ https://issues.apache.org/jira/browse/HIVE-26159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wechar updated HIVE-26159: -- Description: Hive cli is a convenient tool to connect to hive metastore service, but now hive cli can not start even if we use *--service cli* option, it should be a bug of ticket HIVE-24348. *Steps to reproduce:* {code:bash} hive@hive:/root$ /usr/share/hive/bin/hive --service cli --hiveconf hive.metastore.uris=thrift://hive:9084 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Beeline version 4.0.0-alpha-2-SNAPSHOT by Apache Hive beeline> {code} was: Hive cli is a convenient tool to connect to hive metastore service, but now hive cli can not start even if we use *--service cli* option, it should be a bug of ticket [HIVE-24348|https://issues.apache.org/jira/browse/HIVE-24348]. *Step to reproduce:* {code:bash} hive@hive:/root$ /usr/share/hive/bin/hive --service cli --hiveconf hive.metastore.uris=thrift://hive:9084 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Beeline version 4.0.0-alpha-2-SNAPSHOT by Apache Hive beeline> {code} > hive cli is unavailable from hive command > - > > Key: HIVE-26159 > URL: https://issues.apache.org/jira/browse/HIVE-26159 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0-alpha-1 >Reporter: Wechar >Assignee: Wechar >Priority: Major > Fix For: 4.0.0 > > > Hive cli is a convenient tool to connect to hive metastore service, but now > hive cli can not start even if we use *--service cli* option, it should be a > bug of ticket HIVE-24348. > *Steps to reproduce:* > {code:bash} > hive@hive:/root$ /usr/share/hive/bin/hive --service cli --hiveconf > hive.metastore.uris=thrift://hive:9084 > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] > Beeline version 4.0.0-alpha-2-SNAPSHOT by Apache Hive > beeline> > {code} --
[jira] [Assigned] (HIVE-26159) hive cli is unavailable from hive command
[ https://issues.apache.org/jira/browse/HIVE-26159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wechar reassigned HIVE-26159: - > hive cli is unavailable from hive command > - > > Key: HIVE-26159 > URL: https://issues.apache.org/jira/browse/HIVE-26159 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0-alpha-1 >Reporter: Wechar >Assignee: Wechar >Priority: Major > Fix For: 4.0.0 > > > Hive cli is a convenient tool to connect to hive metastore service, but now > hive cli can not start even if we use *--service cli* option, it should be a > bug of ticket [HIVE-24348|https://issues.apache.org/jira/browse/HIVE-24348]. > *Step to reproduce:* > {code:bash} > hive@hive:/root$ /usr/share/hive/bin/hive --service cli --hiveconf > hive.metastore.uris=thrift://hive:9084 > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] > Beeline version 4.0.0-alpha-2-SNAPSHOT by Apache Hive > beeline> > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HIVE-26137) Optimized transfer of Iceberg residual expressions from AM to execution
[ https://issues.apache.org/jira/browse/HIVE-26137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525084#comment-17525084 ] Ádám Szita commented on HIVE-26137: --- Note: this patch also reverts the hacky solution of HIVE-25967 > Optimized transfer of Iceberg residual expressions from AM to execution > --- > > Key: HIVE-26137 > URL: https://issues.apache.org/jira/browse/HIVE-26137 > Project: Hive > Issue Type: Improvement >Reporter: Ádám Szita >Assignee: Ádám Szita >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > HIVE-25967 introduced a hack to prevent Iceberg filter expressions to be > serialized into splits. This temporary fix was to avoid OOM problems on Tez > AM side, but at the same time prevented predicate pushdowns to work on the > execution side too. > This ticket intends to incorporate the long term solution. It turns out that > the file scan tasks created by Iceberg actually don't contain a "residual" > expressions, but rather a complete/original one. It becomes residual only > when it is evaluated against the tasks' partition value, which only happens > on the execution site. This means that the original filter is the same > expression for all splits in Tez AM, so we can transfer it via job conf > instead. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (HIVE-26137) Optimized transfer of Iceberg residual expressions from AM to execution
[ https://issues.apache.org/jira/browse/HIVE-26137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ádám Szita resolved HIVE-26137. --- Fix Version/s: 4.0.0 Resolution: Fixed Committed to master. Thanks for the review [~mbod] > Optimized transfer of Iceberg residual expressions from AM to execution > --- > > Key: HIVE-26137 > URL: https://issues.apache.org/jira/browse/HIVE-26137 > Project: Hive > Issue Type: Improvement >Reporter: Ádám Szita >Assignee: Ádám Szita >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > HIVE-25967 introduced a hack to prevent Iceberg filter expressions to be > serialized into splits. This temporary fix was to avoid OOM problems on Tez > AM side, but at the same time prevented predicate pushdowns to work on the > execution side too. > This ticket intends to incorporate the long term solution. It turns out that > the file scan tasks created by Iceberg actually don't contain a "residual" > expressions, but rather a complete/original one. It becomes residual only > when it is evaluated against the tasks' partition value, which only happens > on the execution site. This means that the original filter is the same > expression for all splits in Tez AM, so we can transfer it via job conf > instead. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26137) Optimized transfer of Iceberg residual expressions from AM to execution
[ https://issues.apache.org/jira/browse/HIVE-26137?focusedWorklogId=759329=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759329 ] ASF GitHub Bot logged work on HIVE-26137: - Author: ASF GitHub Bot Created on: 20/Apr/22 15:46 Start Date: 20/Apr/22 15:46 Worklog Time Spent: 10m Work Description: szlta merged PR #3203: URL: https://github.com/apache/hive/pull/3203 Issue Time Tracking --- Worklog Id: (was: 759329) Time Spent: 40m (was: 0.5h) > Optimized transfer of Iceberg residual expressions from AM to execution > --- > > Key: HIVE-26137 > URL: https://issues.apache.org/jira/browse/HIVE-26137 > Project: Hive > Issue Type: Improvement >Reporter: Ádám Szita >Assignee: Ádám Szita >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > HIVE-25967 introduced a hack to prevent Iceberg filter expressions to be > serialized into splits. This temporary fix was to avoid OOM problems on Tez > AM side, but at the same time prevented predicate pushdowns to work on the > execution side too. > This ticket intends to incorporate the long term solution. It turns out that > the file scan tasks created by Iceberg actually don't contain a "residual" > expressions, but rather a complete/original one. It becomes residual only > when it is evaluated against the tasks' partition value, which only happens > on the execution site. This means that the original filter is the same > expression for all splits in Tez AM, so we can transfer it via job conf > instead. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (HIVE-26137) Optimized transfer of Iceberg residual expressions from AM to execution
[ https://issues.apache.org/jira/browse/HIVE-26137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ádám Szita reassigned HIVE-26137: - Assignee: Ádám Szita > Optimized transfer of Iceberg residual expressions from AM to execution > --- > > Key: HIVE-26137 > URL: https://issues.apache.org/jira/browse/HIVE-26137 > Project: Hive > Issue Type: Improvement >Reporter: Ádám Szita >Assignee: Ádám Szita >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > HIVE-25967 introduced a hack to prevent Iceberg filter expressions to be > serialized into splits. This temporary fix was to avoid OOM problems on Tez > AM side, but at the same time prevented predicate pushdowns to work on the > execution side too. > This ticket intends to incorporate the long term solution. It turns out that > the file scan tasks created by Iceberg actually don't contain a "residual" > expressions, but rather a complete/original one. It becomes residual only > when it is evaluated against the tasks' partition value, which only happens > on the execution site. This means that the original filter is the same > expression for all splits in Tez AM, so we can transfer it via job conf > instead. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg
[ https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759316=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759316 ] ASF GitHub Bot logged work on HIVE-26151: - Author: ASF GitHub Bot Created on: 20/Apr/22 15:26 Start Date: 20/Apr/22 15:26 Worklog Time Spent: 10m Work Description: marton-bod commented on PR #3222: URL: https://github.com/apache/hive/pull/3222#issuecomment-1104065890 The syntax should be marked experimental/alpha with evolving-semantics here. The problem is that FROM ... TO has a different semantics in standard SQL, it returns all rows that were active during the time window (so it can contain rows that were inserted before the time window), instead of just the appends that happened during the window. Let's discuss it tomorrow, but we might want to come up with a way to document it that this alpha/experimental at this stage. Issue Time Tracking --- Worklog Id: (was: 759316) Time Spent: 1h 40m (was: 1.5h) > Support range-based time travel queries for Iceberg > --- > > Key: HIVE-26151 > URL: https://issues.apache.org/jira/browse/HIVE-26151 > Project: Hive > Issue Type: New Feature >Reporter: Marton Bod >Assignee: Marton Bod >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > Allow querying which records have been inserted during a certain time window > for Iceberg tables. The Iceberg TableScan API provides an implementation for > that, so most of the work would go into adding syntax support and > transporting the startTime and endTime parameters to the Iceberg input format. > Proposed new syntax: > SELECT * FROM table FOR SYSTEM_TIME FROM '' TO '' > SELECT * FROM table FOR SYSTEM_VERSION FROM TO > (the TO clause is optional in both cases) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location
[ https://issues.apache.org/jira/browse/HIVE-26157?focusedWorklogId=759306=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759306 ] ASF GitHub Bot logged work on HIVE-26157: - Author: ASF GitHub Bot Created on: 20/Apr/22 15:18 Start Date: 20/Apr/22 15:18 Worklog Time Spent: 10m Work Description: marton-bod commented on code in PR #3226: URL: https://github.com/apache/hive/pull/3226#discussion_r854264528 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java: ## @@ -104,7 +106,7 @@ public class HiveIcebergStorageHandler implements HiveStoragePredicateHandler, HiveStorageHandler { private static final Logger LOG = LoggerFactory.getLogger(HiveIcebergStorageHandler.class); - private static final String ICEBERG_URI_PREFIX = "iceberg://"; + private static final String ICEBERG_URI_PREFIX = "iceberg:/"; Review Comment: Was this change necessary? I think it's supposed to work like a 'protocol' that's why there is the double-slash. I think kafka and druid did the same thing (i.e. `druid://` and `kafka://`) Issue Time Tracking --- Worklog Id: (was: 759306) Time Spent: 1h (was: 50m) > Change Iceberg storage handler authz URI to metadata location > - > > Key: HIVE-26157 > URL: https://issues.apache.org/jira/browse/HIVE-26157 > Project: Hive > Issue Type: Improvement >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > In HIVE-25964, the authz URI has been changed to "iceberg://db.table". > It is possible to set the metadata pointers of table A to point to table B, > and therefore you could read table B's data via querying table A. > {code:sql} > alter table A set tblproperties > ('metadata_location'='/path/to/B/snapshot.json', > 'previous_metadata_location'='/path/to/B/prev_snapshot.json'); {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location
[ https://issues.apache.org/jira/browse/HIVE-26157?focusedWorklogId=759265=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759265 ] ASF GitHub Bot logged work on HIVE-26157: - Author: ASF GitHub Bot Created on: 20/Apr/22 14:29 Start Date: 20/Apr/22 14:29 Worklog Time Spent: 10m Work Description: szlta commented on code in PR #3226: URL: https://github.com/apache/hive/pull/3226#discussion_r854208599 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java: ## @@ -43,6 +45,26 @@ private IcebergTableUtil() { } + /** + * Constructs the table properties needed for the Iceberg table loading by retrieving the information from the + * hmsTable. It then calls {@link IcebergTableUtil#getTable(Configuration, Properties)} with these properties. + * @param configuration a Hadoop configuration + * @param hmsTable the HMS table + * @return the Iceberg table + */ + static Table getTable(Configuration configuration, org.apache.hadoop.hive.metastore.api.Table hmsTable) { +Properties properties = new Properties(); +properties.setProperty(Catalogs.NAME, TableIdentifier.of(hmsTable.getDbName(), hmsTable.getTableName()).toString()); +if (hmsTable.getSd() != null) { + properties.setProperty(Catalogs.LOCATION, hmsTable.getSd().getLocation()); +} +if (hmsTable.getParameters().containsKey(InputFormatConfig.CATALOG_NAME)) { + properties.setProperty( + InputFormatConfig.CATALOG_NAME, hmsTable.getParameters().get(InputFormatConfig.CATALOG_NAME)); +} Review Comment: nit: could do with one look-up only by calling get() beforehand, and refactoring the if condition into null check. Issue Time Tracking --- Worklog Id: (was: 759265) Time Spent: 50m (was: 40m) > Change Iceberg storage handler authz URI to metadata location > - > > Key: HIVE-26157 > URL: https://issues.apache.org/jira/browse/HIVE-26157 > Project: Hive > Issue Type: Improvement >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > In HIVE-25964, the authz URI has been changed to "iceberg://db.table". > It is possible to set the metadata pointers of table A to point to table B, > and therefore you could read table B's data via querying table A. > {code:sql} > alter table A set tblproperties > ('metadata_location'='/path/to/B/snapshot.json', > 'previous_metadata_location'='/path/to/B/prev_snapshot.json'); {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location
[ https://issues.apache.org/jira/browse/HIVE-26157?focusedWorklogId=759255=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759255 ] ASF GitHub Bot logged work on HIVE-26157: - Author: ASF GitHub Bot Created on: 20/Apr/22 14:18 Start Date: 20/Apr/22 14:18 Worklog Time Spent: 10m Work Description: szlta commented on code in PR #3226: URL: https://github.com/apache/hive/pull/3226#discussion_r854194435 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java: ## @@ -450,7 +452,9 @@ public boolean isValidMetadataTable(String metaTableName) { public URI getURIForAuth(org.apache.hadoop.hive.metastore.api.Table hmsTable) throws URISyntaxException { String dbName = hmsTable.getDbName(); String tableName = hmsTable.getTableName(); -return new URI(ICEBERG_URI_PREFIX + dbName + "/" + tableName); +Table table = IcebergTableUtil.getTable(conf, hmsTable); Review Comment: Giving back `hmsTable.getSd().location() + "/metadata"` seems reasonable to me in such cases Issue Time Tracking --- Worklog Id: (was: 759255) Time Spent: 40m (was: 0.5h) > Change Iceberg storage handler authz URI to metadata location > - > > Key: HIVE-26157 > URL: https://issues.apache.org/jira/browse/HIVE-26157 > Project: Hive > Issue Type: Improvement >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > In HIVE-25964, the authz URI has been changed to "iceberg://db.table". > It is possible to set the metadata pointers of table A to point to table B, > and therefore you could read table B's data via querying table A. > {code:sql} > alter table A set tblproperties > ('metadata_location'='/path/to/B/snapshot.json', > 'previous_metadata_location'='/path/to/B/prev_snapshot.json'); {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location
[ https://issues.apache.org/jira/browse/HIVE-26157?focusedWorklogId=759247=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759247 ] ASF GitHub Bot logged work on HIVE-26157: - Author: ASF GitHub Bot Created on: 20/Apr/22 13:50 Start Date: 20/Apr/22 13:50 Worklog Time Spent: 10m Work Description: marton-bod commented on code in PR #3226: URL: https://github.com/apache/hive/pull/3226#discussion_r854160373 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java: ## @@ -450,7 +452,9 @@ public boolean isValidMetadataTable(String metaTableName) { public URI getURIForAuth(org.apache.hadoop.hive.metastore.api.Table hmsTable) throws URISyntaxException { String dbName = hmsTable.getDbName(); String tableName = hmsTable.getTableName(); -return new URI(ICEBERG_URI_PREFIX + dbName + "/" + tableName); +Table table = IcebergTableUtil.getTable(conf, hmsTable); Review Comment: We ran into problems with the approach of loading the Iceberg table here before. The problem is that this method can be called to authorize CREATE TABLE commands as well, at which point the iceberg table does not exist yet, so this will lead to NPE. In that case, when the table object is null, then maybe we fall back to using the `hmsTable.getSd().location() + "/metadata"`? I'm not sure though, just thinking out loud Issue Time Tracking --- Worklog Id: (was: 759247) Time Spent: 0.5h (was: 20m) > Change Iceberg storage handler authz URI to metadata location > - > > Key: HIVE-26157 > URL: https://issues.apache.org/jira/browse/HIVE-26157 > Project: Hive > Issue Type: Improvement >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > In HIVE-25964, the authz URI has been changed to "iceberg://db.table". > It is possible to set the metadata pointers of table A to point to table B, > and therefore you could read table B's data via querying table A. > {code:sql} > alter table A set tblproperties > ('metadata_location'='/path/to/B/snapshot.json', > 'previous_metadata_location'='/path/to/B/prev_snapshot.json'); {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location
[ https://issues.apache.org/jira/browse/HIVE-26157?focusedWorklogId=759246=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759246 ] ASF GitHub Bot logged work on HIVE-26157: - Author: ASF GitHub Bot Created on: 20/Apr/22 13:50 Start Date: 20/Apr/22 13:50 Worklog Time Spent: 10m Work Description: marton-bod commented on code in PR #3226: URL: https://github.com/apache/hive/pull/3226#discussion_r854160373 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java: ## @@ -450,7 +452,9 @@ public boolean isValidMetadataTable(String metaTableName) { public URI getURIForAuth(org.apache.hadoop.hive.metastore.api.Table hmsTable) throws URISyntaxException { String dbName = hmsTable.getDbName(); String tableName = hmsTable.getTableName(); -return new URI(ICEBERG_URI_PREFIX + dbName + "/" + tableName); +Table table = IcebergTableUtil.getTable(conf, hmsTable); Review Comment: We ran into problems with the approach of loading the Iceberg table here before. The problem is that this method can be called to authorize CREATE TABLE commands as well, at which point the iceberg table does not exist yet, so this will lead to NPE. If the table object is null, then maybe we can use the hmsTable.getSd().location() + "/metadata"? I'm not sure though, just thinking out loud Issue Time Tracking --- Worklog Id: (was: 759246) Time Spent: 20m (was: 10m) > Change Iceberg storage handler authz URI to metadata location > - > > Key: HIVE-26157 > URL: https://issues.apache.org/jira/browse/HIVE-26157 > Project: Hive > Issue Type: Improvement >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > In HIVE-25964, the authz URI has been changed to "iceberg://db.table". > It is possible to set the metadata pointers of table A to point to table B, > and therefore you could read table B's data via querying table A. > {code:sql} > alter table A set tblproperties > ('metadata_location'='/path/to/B/snapshot.json', > 'previous_metadata_location'='/path/to/B/prev_snapshot.json'); {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26156) Iceberg delete writer should handle deleting from old partition specs
[ https://issues.apache.org/jira/browse/HIVE-26156?focusedWorklogId=759239=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759239 ] ASF GitHub Bot logged work on HIVE-26156: - Author: ASF GitHub Bot Created on: 20/Apr/22 13:39 Start Date: 20/Apr/22 13:39 Worklog Time Spent: 10m Work Description: marton-bod commented on code in PR #3225: URL: https://github.com/apache/hive/pull/3225#discussion_r854147948 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java: ## @@ -374,9 +373,12 @@ public DynamicPartitionCtx createDPContext(HiveConf hiveConf, org.apache.hadoop. fieldOrderMap.put(fields.get(i).name(), i); } +// deletes already use the bucket values in the partition_struct for sorting, so no need to add the sort expression Review Comment: Yes, good catch, I think we can avoid sorting by the other partition columns too. Issue Time Tracking --- Worklog Id: (was: 759239) Time Spent: 40m (was: 0.5h) > Iceberg delete writer should handle deleting from old partition specs > - > > Key: HIVE-26156 > URL: https://issues.apache.org/jira/browse/HIVE-26156 > Project: Hive > Issue Type: Bug >Reporter: Marton Bod >Assignee: Marton Bod >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > While {{HiveIcebergRecordWriter}} always writes data out according to the > latest spec, the {{HiveIcebergDeleteWriter}} might have to write delete files > into partitions that correspond to a variety of specs, both old and new. > Therefore we should pass the {{{}table.specs(){}}}map into the > {{HiveIcebergWriter}} so that the delete writer can choose the appropriate > spec on a per-record basis. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (HIVE-24920) TRANSLATED_TO_EXTERNAL tables may write to the same location
[ https://issues.apache.org/jira/browse/HIVE-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524983#comment-17524983 ] tanghui edited comment on HIVE-24920 at 4/20/22 1:38 PM: - After the patch is updated, the partition table location and hdfs data directory are displayed normally, but the partition location of the table in the SDS in the Hive metabase is still displayed as the location of the old table, resulting in no data in the query partition. in beeline: set hive.create.as.external.legacy=true; CREATE TABLE part_test( c1 string ,c2 string )PARTITIONED BY (dat string) insert into part_test values ("11","th","20220101") insert into part_test values ("22","th","20220102") alter table part_test rename to part_test11; --this result is null. select * from part_test11 where dat="20220101"; ||part_test.c1||part_test.c2||part_test.dat|| | | | | - SDS in the Hive metabase: select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND TBLS.TBL_ID=SDS.CD_ID; --- |*LOCATION*| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102| --- We need to modify the partition location of the table in SDS to ensure that the query results are normal was (Author: sanguines): After the patch is updated, the partition table location and hdfs data directory are displayed normally, but the partition location of the table in the SDS in the Hive metabase is still displayed as the location of the old table, resulting in no data in the query partition. set hive.create.as.external.legacy=true; CREATE TABLE part_test( c1 string ,c2 string )PARTITIONED BY (dat string) insert into part_test values ("11","th","20220101") insert into part_test values ("22","th","20220102") alter table part_test rename to part_test11; --this resulting in no data in the query partition. select * from part_test11 where dat="20220101"; - SDS in the Hive metabase: select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND TBLS.TBL_ID=SDS.CD_ID; --- |LOCATION| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102| --- We need to modify the partition location of the table in SDS to ensure that the query results are normal > TRANSLATED_TO_EXTERNAL tables may write to the same location > > > Key: HIVE-24920 > URL: https://issues.apache.org/jira/browse/HIVE-24920 > Project: Hive > Issue Type: Bug >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: metastore_translator, pull-request-available > Fix For: 4.0.0, 4.0.0-alpha-1 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > {code} > create table t (a integer); > insert into t values(1); > alter table t rename to t2; > create table t (a integer); -- I expected an exception from this command > (location already exists) but because its an external table no exception > insert into t values(2); > select * from t; -- shows 1 and 2 > drop table t2;-- wipes out data location > select * from t; -- empty resultset > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26156) Iceberg delete writer should handle deleting from old partition specs
[ https://issues.apache.org/jira/browse/HIVE-26156?focusedWorklogId=759237=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759237 ] ASF GitHub Bot logged work on HIVE-26156: - Author: ASF GitHub Bot Created on: 20/Apr/22 13:37 Start Date: 20/Apr/22 13:37 Worklog Time Spent: 10m Work Description: marton-bod commented on code in PR #3225: URL: https://github.com/apache/hive/pull/3225#discussion_r854146249 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergRecordWriter.java: ## @@ -37,17 +38,17 @@ class HiveIcebergRecordWriter extends HiveIcebergWriter { - HiveIcebergRecordWriter(Schema schema, PartitionSpec spec, FileFormat format, + HiveIcebergRecordWriter(Schema schema, Map specs, FileFormat format, FileWriterFactory fileWriterFactory, OutputFileFactory fileFactory, FileIO io, long targetFileSize, TaskAttemptID taskAttemptID, String tableName) { -super(schema, spec, io, taskAttemptID, tableName, +super(schema, specs, io, taskAttemptID, tableName, new ClusteredDataWriter<>(fileWriterFactory, fileFactory, io, format, targetFileSize)); } @Override public void write(Writable row) throws IOException { Record record = ((Container) row).get(); -writer.write(record, spec, partition(record)); +writer.write(record, specs.get(specs.size() - 1), partition(record)); Review Comment: Good catch, I did not know that Iceberg reused the old spec in step3. Will store the latest spec in the record writer then Issue Time Tracking --- Worklog Id: (was: 759237) Time Spent: 0.5h (was: 20m) > Iceberg delete writer should handle deleting from old partition specs > - > > Key: HIVE-26156 > URL: https://issues.apache.org/jira/browse/HIVE-26156 > Project: Hive > Issue Type: Bug >Reporter: Marton Bod >Assignee: Marton Bod >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > While {{HiveIcebergRecordWriter}} always writes data out according to the > latest spec, the {{HiveIcebergDeleteWriter}} might have to write delete files > into partitions that correspond to a variety of specs, both old and new. > Therefore we should pass the {{{}table.specs(){}}}map into the > {{HiveIcebergWriter}} so that the delete writer can choose the appropriate > spec on a per-record basis. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HIVE-26158) TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after rename table
[ https://issues.apache.org/jira/browse/HIVE-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tanghui updated HIVE-26158: --- Description: After the patch is updated, the partition table location and hdfs data directory are displayed normally, but the partition location of the table in the SDS in the Hive metabase is still displayed as the location of the old table, resulting in no data in the query partition. in beeline: set hive.create.as.external.legacy=true; CREATE TABLE part_test( c1 string ,c2 string )PARTITIONED BY (dat string) insert into part_test values ("11","th","20220101") insert into part_test values ("22","th","20220102") alter table part_test rename to part_test11; --this result is null. select * from part_test11 where dat="20220101"; ||part_test.c1||part_test.c2||part_test.dat|| | | | | - SDS in the Hive metabase: select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND TBLS.TBL_ID=SDS.CD_ID; --- |*LOCATION*| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102| --- We need to modify the partition location of the table in SDS to ensure that the query results are normal was: After the patch is updated, the partition table location and hdfs data directory are displayed normally, but the partition location of the table in the SDS in the Hive metabase is still displayed as the location of the old table, resulting in no data in the query partition. in beeline: set hive.create.as.external.legacy=true; CREATE TABLE part_test( c1 string ,c2 string )PARTITIONED BY (dat string) insert into part_test values ("11","th","20220101") insert into part_test values ("22","th","20220102") alter table part_test rename to part_test11; --this result is null. select * from part_test11 where dat="20220101"; ||part_test.c1||part_test.c2||part_test.dat|| | | | | - SDS in the Hive metabase: select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND TBLS.TBL_ID=SDS.CD_ID; --- |LOCATION| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102| --- We need to modify the partition location of the table in SDS to ensure that the query results are normal > TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after > rename table > -- > > Key: HIVE-26158 > URL: https://issues.apache.org/jira/browse/HIVE-26158 > Project: Hive > Issue Type: Bug >Affects Versions: 4.0.0, 4.0.0-alpha-1, 4.0.0-alpha-2 >Reporter: tanghui >Priority: Major > > After the patch is updated, the partition table location and hdfs data > directory are displayed normally, but the partition location of the table in > the SDS in the Hive metabase is still displayed as the location of the old > table, resulting in no data in the query partition. > > in beeline: > > set hive.create.as.external.legacy=true; > CREATE TABLE part_test( > c1 string > ,c2 string > )PARTITIONED BY (dat string) > insert into part_test values ("11","th","20220101") > insert into part_test values ("22","th","20220102") > alter table part_test rename to part_test11; > --this result is null. > select * from part_test11 where dat="20220101"; > ||part_test.c1||part_test.c2||part_test.dat|| > | | | | > - > SDS in the Hive metabase: > select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND > TBLS.TBL_ID=SDS.CD_ID; > --- > |*LOCATION*| > |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| > |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| >
[jira] [Work logged] (HIVE-26156) Iceberg delete writer should handle deleting from old partition specs
[ https://issues.apache.org/jira/browse/HIVE-26156?focusedWorklogId=759236=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759236 ] ASF GitHub Bot logged work on HIVE-26156: - Author: ASF GitHub Bot Created on: 20/Apr/22 13:37 Start Date: 20/Apr/22 13:37 Worklog Time Spent: 10m Work Description: marton-bod commented on code in PR #3225: URL: https://github.com/apache/hive/pull/3225#discussion_r854145547 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergDeleteWriter.java: ## @@ -53,7 +54,8 @@ public class HiveIcebergDeleteWriter extends HiveIcebergWriter { public void write(Writable row) throws IOException { Record rec = ((Container) row).get(); PositionDelete positionDelete = IcebergAcidUtil.getPositionDelete(rec, rowDataTemplate); -writer.write(positionDelete, spec, partition(positionDelete.row())); +Integer specId = rec.get(0, Integer.class); Review Comment: actually, there's an existing util method to parse out the specid from a record, so I'll use that Issue Time Tracking --- Worklog Id: (was: 759236) Time Spent: 20m (was: 10m) > Iceberg delete writer should handle deleting from old partition specs > - > > Key: HIVE-26156 > URL: https://issues.apache.org/jira/browse/HIVE-26156 > Project: Hive > Issue Type: Bug >Reporter: Marton Bod >Assignee: Marton Bod >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > While {{HiveIcebergRecordWriter}} always writes data out according to the > latest spec, the {{HiveIcebergDeleteWriter}} might have to write delete files > into partitions that correspond to a variety of specs, both old and new. > Therefore we should pass the {{{}table.specs(){}}}map into the > {{HiveIcebergWriter}} so that the delete writer can choose the appropriate > spec on a per-record basis. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HIVE-26158) TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after rename table
[ https://issues.apache.org/jira/browse/HIVE-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tanghui updated HIVE-26158: --- Description: After the patch is updated, the partition table location and hdfs data directory are displayed normally, but the partition location of the table in the SDS in the Hive metabase is still displayed as the location of the old table, resulting in no data in the query partition. in beeline: set hive.create.as.external.legacy=true; CREATE TABLE part_test( c1 string ,c2 string )PARTITIONED BY (dat string) insert into part_test values ("11","th","20220101") insert into part_test values ("22","th","20220102") alter table part_test rename to part_test11; --this result is null. select * from part_test11 where dat="20220101"; ||part_test.c1||part_test.c2||part_test.dat|| | | | | - SDS in the Hive metabase: select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND TBLS.TBL_ID=SDS.CD_ID; --- |LOCATION| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102| --- We need to modify the partition location of the table in SDS to ensure that the query results are normal was: After the patch is updated, the partition table location and hdfs data directory are displayed normally, but the partition location of the table in the SDS in the Hive metabase is still displayed as the location of the old table, resulting in no data in the query partition. in beeline: set hive.create.as.external.legacy=true; CREATE TABLE part_test( c1 string ,c2 string )PARTITIONED BY (dat string) insert into part_test values ("11","th","20220101") insert into part_test values ("22","th","20220102") alter table part_test rename to part_test11; --this resulting in no data in the query partition. select * from part_test11 where dat="20220101"; - SDS in the Hive metabase: select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND TBLS.TBL_ID=SDS.CD_ID; --- |LOCATION| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102| --- We need to modify the partition location of the table in SDS to ensure that the query results are normal > TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after > rename table > -- > > Key: HIVE-26158 > URL: https://issues.apache.org/jira/browse/HIVE-26158 > Project: Hive > Issue Type: Bug >Affects Versions: 4.0.0, 4.0.0-alpha-1, 4.0.0-alpha-2 >Reporter: tanghui >Priority: Major > > After the patch is updated, the partition table location and hdfs data > directory are displayed normally, but the partition location of the table in > the SDS in the Hive metabase is still displayed as the location of the old > table, resulting in no data in the query partition. > > in beeline: > > set hive.create.as.external.legacy=true; > CREATE TABLE part_test( > c1 string > ,c2 string > )PARTITIONED BY (dat string) > insert into part_test values ("11","th","20220101") > insert into part_test values ("22","th","20220102") > alter table part_test rename to part_test11; > --this result is null. > select * from part_test11 where dat="20220101"; > ||part_test.c1||part_test.c2||part_test.dat|| > | | | | > - > SDS in the Hive metabase: > select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND > TBLS.TBL_ID=SDS.CD_ID; > --- > |LOCATION| > |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| > |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| >
[jira] [Work logged] (HIVE-21456) Hive Metastore Thrift over HTTP
[ https://issues.apache.org/jira/browse/HIVE-21456?focusedWorklogId=759233=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759233 ] ASF GitHub Bot logged work on HIVE-21456: - Author: ASF GitHub Bot Created on: 20/Apr/22 13:33 Start Date: 20/Apr/22 13:33 Worklog Time Spent: 10m Work Description: yongzhi merged PR #3105: URL: https://github.com/apache/hive/pull/3105 Issue Time Tracking --- Worklog Id: (was: 759233) Time Spent: 6h 40m (was: 6.5h) > Hive Metastore Thrift over HTTP > --- > > Key: HIVE-21456 > URL: https://issues.apache.org/jira/browse/HIVE-21456 > Project: Hive > Issue Type: New Feature > Components: Metastore, Standalone Metastore >Reporter: Amit Khanna >Assignee: Sourabh Goyal >Priority: Major > Labels: pull-request-available > Attachments: HIVE-21456.2.patch, HIVE-21456.3.patch, > HIVE-21456.4.patch, HIVE-21456.patch > > Time Spent: 6h 40m > Remaining Estimate: 0h > > Hive Metastore currently doesn't have support for HTTP transport because of > which it is not possible to access it via Knox. Adding support for Thrift > over HTTP transport will allow the clients to access via Knox -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HIVE-26158) TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after rename table
[ https://issues.apache.org/jira/browse/HIVE-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tanghui updated HIVE-26158: --- Description: After the patch is updated, the partition table location and hdfs data directory are displayed normally, but the partition location of the table in the SDS in the Hive metabase is still displayed as the location of the old table, resulting in no data in the query partition. in beeline: set hive.create.as.external.legacy=true; CREATE TABLE part_test( c1 string ,c2 string )PARTITIONED BY (dat string) insert into part_test values ("11","th","20220101") insert into part_test values ("22","th","20220102") alter table part_test rename to part_test11; --this resulting in no data in the query partition. select * from part_test11 where dat="20220101"; - SDS in the Hive metabase: select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND TBLS.TBL_ID=SDS.CD_ID; --- |LOCATION| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102| --- We need to modify the partition location of the table in SDS to ensure that the query results are normal was: After the patch is updated, the partition table location and hdfs data directory are displayed normally, but the partition location of the table in the SDS in the Hive metabase is still displayed as the location of the old table, resulting in no data in the query partition. set hive.create.as.external.legacy=true; CREATE TABLE part_test( c1 string ,c2 string )PARTITIONED BY (dat string) insert into part_test values ("11","th","20220101") insert into part_test values ("22","th","20220102") alter table part_test rename to part_test11; --this resulting in no data in the query partition. select * from part_test11 where dat="20220101"; - SDS in the Hive metabase: select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND TBLS.TBL_ID=SDS.CD_ID; --- |LOCATION| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102| --- We need to modify the partition location of the table in SDS to ensure that the query results are normal > TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after > rename table > -- > > Key: HIVE-26158 > URL: https://issues.apache.org/jira/browse/HIVE-26158 > Project: Hive > Issue Type: Bug >Affects Versions: 4.0.0, 4.0.0-alpha-1, 4.0.0-alpha-2 >Reporter: tanghui >Priority: Major > > After the patch is updated, the partition table location and hdfs data > directory are displayed normally, but the partition location of the table in > the SDS in the Hive metabase is still displayed as the location of the old > table, resulting in no data in the query partition. > > in beeline: > > set hive.create.as.external.legacy=true; > CREATE TABLE part_test( > c1 string > ,c2 string > )PARTITIONED BY (dat string) > insert into part_test values ("11","th","20220101") > insert into part_test values ("22","th","20220102") > alter table part_test rename to part_test11; > --this resulting in no data in the query partition. > select * from part_test11 where dat="20220101"; > - > SDS in the Hive metabase: > select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND > TBLS.TBL_ID=SDS.CD_ID; > --- > |LOCATION| > |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| > |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| > |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102| >
[jira] [Updated] (HIVE-26158) TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after rename table
[ https://issues.apache.org/jira/browse/HIVE-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tanghui updated HIVE-26158: --- Summary: TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after rename table (was: TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after rename) > TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after > rename table > -- > > Key: HIVE-26158 > URL: https://issues.apache.org/jira/browse/HIVE-26158 > Project: Hive > Issue Type: Bug >Affects Versions: 4.0.0, 4.0.0-alpha-1, 4.0.0-alpha-2 >Reporter: tanghui >Priority: Major > > After the patch is updated, the partition table location and hdfs data > directory are displayed normally, but the partition location of the table in > the SDS in the Hive metabase is still displayed as the location of the old > table, resulting in no data in the query partition. > > set hive.create.as.external.legacy=true; > CREATE TABLE part_test( > c1 string > ,c2 string > )PARTITIONED BY (dat string) > insert into part_test values ("11","th","20220101") > insert into part_test values ("22","th","20220102") > alter table part_test rename to part_test11; > --this resulting in no data in the query partition. > select * from part_test11 where dat="20220101"; > - > SDS in the Hive metabase: > select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND > TBLS.TBL_ID=SDS.CD_ID; > --- > |LOCATION| > |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| > |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| > |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102| > --- > > We need to modify the partition location of the table in SDS to ensure that > the query results are normal -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location
[ https://issues.apache.org/jira/browse/HIVE-26157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-26157: -- Labels: pull-request-available (was: ) > Change Iceberg storage handler authz URI to metadata location > - > > Key: HIVE-26157 > URL: https://issues.apache.org/jira/browse/HIVE-26157 > Project: Hive > Issue Type: Improvement >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In HIVE-25964, the authz URI has been changed to "iceberg://db.table". > It is possible to set the metadata pointers of table A to point to table B, > and therefore you could read table B's data via querying table A. > {code:sql} > alter table A set tblproperties > ('metadata_location'='/path/to/B/snapshot.json', > 'previous_metadata_location'='/path/to/B/prev_snapshot.json'); {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location
[ https://issues.apache.org/jira/browse/HIVE-26157?focusedWorklogId=759231=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759231 ] ASF GitHub Bot logged work on HIVE-26157: - Author: ASF GitHub Bot Created on: 20/Apr/22 13:30 Start Date: 20/Apr/22 13:30 Worklog Time Spent: 10m Work Description: lcspinter opened a new pull request, #3226: URL: https://github.com/apache/hive/pull/3226 ### What changes were proposed in this pull request? Change Iceberg storage handler autz URI from `iceberg://dbName/tableName` format to `iceberg://metadataLocation` ### Why are the changes needed? It is possible to set the metadata pointers of table A to point to table B, and therefore you could read table B's data via querying table A. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test, unit test Issue Time Tracking --- Worklog Id: (was: 759231) Remaining Estimate: 0h Time Spent: 10m > Change Iceberg storage handler authz URI to metadata location > - > > Key: HIVE-26157 > URL: https://issues.apache.org/jira/browse/HIVE-26157 > Project: Hive > Issue Type: Improvement >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In HIVE-25964, the authz URI has been changed to "iceberg://db.table". > It is possible to set the metadata pointers of table A to point to table B, > and therefore you could read table B's data via querying table A. > {code:sql} > alter table A set tblproperties > ('metadata_location'='/path/to/B/snapshot.json', > 'previous_metadata_location'='/path/to/B/prev_snapshot.json'); {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (HIVE-24920) TRANSLATED_TO_EXTERNAL tables may write to the same location
[ https://issues.apache.org/jira/browse/HIVE-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524983#comment-17524983 ] tanghui edited comment on HIVE-24920 at 4/20/22 1:18 PM: - After the patch is updated, the partition table location and hdfs data directory are displayed normally, but the partition location of the table in the SDS in the Hive metabase is still displayed as the location of the old table, resulting in no data in the query partition. set hive.create.as.external.legacy=true; CREATE TABLE part_test( c1 string ,c2 string )PARTITIONED BY (dat string) insert into part_test values ("11","th","20220101") insert into part_test values ("22","th","20220102") alter table part_test rename to part_test11; --this resulting in no data in the query partition. select * from part_test11 where dat="20220101"; - SDS in the Hive metabase: select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND TBLS.TBL_ID=SDS.CD_ID; --- |LOCATION| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102| --- We need to modify the partition location of the table in SDS to ensure that the query results are normal was (Author: sanguines): After the patch is updated, the partition table location and hdfs data directory are displayed normally, but the partition location of the table in the SDS in the Hive metabase is still displayed as the location of the old table, resulting in no data in the query partition. CREATE TABLE part_test( c1 string ,c2 string )PARTITIONED BY (dat string) insert into part_test values ("11","th","20220101") insert into part_test values ("22","th","20220102") alter table part_test rename to part_test11; --this resulting in no data in the query partition. select * from part_test11 where dat="20220101"; - SDS in the Hive metabase: select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND TBLS.TBL_ID=SDS.CD_ID; --- |LOCATION| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102| --- We need to modify the partition location of the table in SDS to ensure that the query results are normal > TRANSLATED_TO_EXTERNAL tables may write to the same location > > > Key: HIVE-24920 > URL: https://issues.apache.org/jira/browse/HIVE-24920 > Project: Hive > Issue Type: Bug >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: metastore_translator, pull-request-available > Fix For: 4.0.0, 4.0.0-alpha-1 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > {code} > create table t (a integer); > insert into t values(1); > alter table t rename to t2; > create table t (a integer); -- I expected an exception from this command > (location already exists) but because its an external table no exception > insert into t values(2); > select * from t; -- shows 1 and 2 > drop table t2;-- wipes out data location > select * from t; -- empty resultset > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location
[ https://issues.apache.org/jira/browse/HIVE-26157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] László Pintér reassigned HIVE-26157: > Change Iceberg storage handler authz URI to metadata location > - > > Key: HIVE-26157 > URL: https://issues.apache.org/jira/browse/HIVE-26157 > Project: Hive > Issue Type: Improvement >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > > In HIVE-25964, the authz URI has been changed to "iceberg://db.table". > It is possible to set the metadata pointers of table A to point to table B, > and therefore you could read table B's data via querying table A. > {code:sql} > alter table A set tblproperties > ('metadata_location'='/path/to/B/snapshot.json', > 'previous_metadata_location'='/path/to/B/prev_snapshot.json'); {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26156) Iceberg delete writer should handle deleting from old partition specs
[ https://issues.apache.org/jira/browse/HIVE-26156?focusedWorklogId=759212=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759212 ] ASF GitHub Bot logged work on HIVE-26156: - Author: ASF GitHub Bot Created on: 20/Apr/22 13:13 Start Date: 20/Apr/22 13:13 Worklog Time Spent: 10m Work Description: szlta commented on code in PR #3225: URL: https://github.com/apache/hive/pull/3225#discussion_r854074661 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergDeleteWriter.java: ## @@ -53,7 +54,8 @@ public class HiveIcebergDeleteWriter extends HiveIcebergWriter { public void write(Writable row) throws IOException { Record rec = ((Container) row).get(); PositionDelete positionDelete = IcebergAcidUtil.getPositionDelete(rec, rowDataTemplate); -writer.write(positionDelete, spec, partition(positionDelete.row())); +Integer specId = rec.get(0, Integer.class); Review Comment: I guess this is meta column at position 0? Could you annotate with some comments please? ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java: ## @@ -374,9 +373,12 @@ public DynamicPartitionCtx createDPContext(HiveConf hiveConf, org.apache.hadoop. fieldOrderMap.put(fields.get(i).name(), i); } +// deletes already use the bucket values in the partition_struct for sorting, so no need to add the sort expression Review Comment: Is this true for bucket transform only? Is bucket() special for some reason, or could we avoid sorting according to other partition columns (and transform types) as well? ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergWriter.java: ## @@ -101,7 +101,14 @@ public void close(boolean abort) throws IOException { } protected PartitionKey partition(Record row) { -currentKey.partition(wrapper.wrap(row)); -return currentKey; +// get partition key for the latest spec +return partition(row, specs.size() - 1); Review Comment: Same question for getting the latest spec. ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergRecordWriter.java: ## @@ -37,17 +38,17 @@ class HiveIcebergRecordWriter extends HiveIcebergWriter { - HiveIcebergRecordWriter(Schema schema, PartitionSpec spec, FileFormat format, + HiveIcebergRecordWriter(Schema schema, Map specs, FileFormat format, FileWriterFactory fileWriterFactory, OutputFileFactory fileFactory, FileIO io, long targetFileSize, TaskAttemptID taskAttemptID, String tableName) { -super(schema, spec, io, taskAttemptID, tableName, +super(schema, specs, io, taskAttemptID, tableName, new ClusteredDataWriter<>(fileWriterFactory, fileFactory, io, format, targetFileSize)); } @Override public void write(Writable row) throws IOException { Record record = ((Container) row).get(); -writer.write(record, spec, partition(record)); +writer.write(record, specs.get(specs.size() - 1), partition(record)); Review Comment: Are we trying to get the latest spec here? If so it could become problematic if an older spec is reused to be the current one. E.g: partition evolution goes as: 1. initially partitioned by col and col2 -> spec0: identity(col), identity(col2); latest_spec=0 2. remove col from spec -> spec1: identity(col2); latest_spec=1 3. re-add col to spec -> no new spec is created; latest_spec=0 Issue Time Tracking --- Worklog Id: (was: 759212) Remaining Estimate: 0h Time Spent: 10m > Iceberg delete writer should handle deleting from old partition specs > - > > Key: HIVE-26156 > URL: https://issues.apache.org/jira/browse/HIVE-26156 > Project: Hive > Issue Type: Bug >Reporter: Marton Bod >Assignee: Marton Bod >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > While {{HiveIcebergRecordWriter}} always writes data out according to the > latest spec, the {{HiveIcebergDeleteWriter}} might have to write delete files > into partitions that correspond to a variety of specs, both old and new. > Therefore we should pass the {{{}table.specs(){}}}map into the > {{HiveIcebergWriter}} so that the delete writer can choose the appropriate > spec on a per-record basis. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HIVE-24920) TRANSLATED_TO_EXTERNAL tables may write to the same location
[ https://issues.apache.org/jira/browse/HIVE-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524983#comment-17524983 ] tanghui commented on HIVE-24920: After the patch is updated, the partition table location and hdfs data directory are displayed normally, but the partition location of the table in the SDS in the Hive metabase is still displayed as the location of the old table, resulting in no data in the query partition. CREATE TABLE part_test( c1 string ,c2 string )PARTITIONED BY (dat string) insert into part_test values ("11","th","20220101") insert into part_test values ("22","th","20220102") alter table part_test rename to part_test11; --this resulting in no data in the query partition. select * from part_test11 where dat="20220101"; - SDS in the Hive metabase: select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND TBLS.TBL_ID=SDS.CD_ID; --- |LOCATION| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101| |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102| --- We need to modify the partition location of the table in SDS to ensure that the query results are normal > TRANSLATED_TO_EXTERNAL tables may write to the same location > > > Key: HIVE-24920 > URL: https://issues.apache.org/jira/browse/HIVE-24920 > Project: Hive > Issue Type: Bug >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich >Priority: Major > Labels: metastore_translator, pull-request-available > Fix For: 4.0.0, 4.0.0-alpha-1 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > {code} > create table t (a integer); > insert into t values(1); > alter table t rename to t2; > create table t (a integer); -- I expected an exception from this command > (location already exists) but because its an external table no exception > insert into t values(2); > select * from t; -- shows 1 and 2 > drop table t2;-- wipes out data location > select * from t; -- empty resultset > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HIVE-26156) Iceberg delete writer should handle deleting from old partition specs
[ https://issues.apache.org/jira/browse/HIVE-26156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-26156: -- Labels: pull-request-available (was: ) > Iceberg delete writer should handle deleting from old partition specs > - > > Key: HIVE-26156 > URL: https://issues.apache.org/jira/browse/HIVE-26156 > Project: Hive > Issue Type: Bug >Reporter: Marton Bod >Assignee: Marton Bod >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > While {{HiveIcebergRecordWriter}} always writes data out according to the > latest spec, the {{HiveIcebergDeleteWriter}} might have to write delete files > into partitions that correspond to a variety of specs, both old and new. > Therefore we should pass the {{{}table.specs(){}}}map into the > {{HiveIcebergWriter}} so that the delete writer can choose the appropriate > spec on a per-record basis. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (HIVE-26156) Iceberg delete writer should handle deleting from old partition specs
[ https://issues.apache.org/jira/browse/HIVE-26156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Bod reassigned HIVE-26156: - > Iceberg delete writer should handle deleting from old partition specs > - > > Key: HIVE-26156 > URL: https://issues.apache.org/jira/browse/HIVE-26156 > Project: Hive > Issue Type: Bug >Reporter: Marton Bod >Assignee: Marton Bod >Priority: Major > > While {{HiveIcebergRecordWriter}} always writes data out according to the > latest spec, the {{HiveIcebergDeleteWriter}} might have to write delete files > into partitions that correspond to a variety of specs, both old and new. > Therefore we should pass the {{{}table.specs(){}}}map into the > {{HiveIcebergWriter}} so that the delete writer can choose the appropriate > spec on a per-record basis. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg
[ https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759172=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759172 ] ASF GitHub Bot logged work on HIVE-26151: - Author: ASF GitHub Bot Created on: 20/Apr/22 12:34 Start Date: 20/Apr/22 12:34 Worklog Time Spent: 10m Work Description: lcspinter commented on code in PR #3222: URL: https://github.com/apache/hive/pull/3222#discussion_r854080693 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java: ## @@ -163,4 +165,32 @@ public static void updateSpec(Configuration configuration, Table table) { public static boolean isBucketed(Table table) { return table.spec().fields().stream().anyMatch(f -> f.transform().toString().startsWith("bucket[")); } + + /** + * Returns the snapshot ID which is immediately before (or exactly at) the timestamp provided in millis. + * If the timestamp provided is before the first snapshot of the table, we return an empty optional. + * If the timestamp provided is in the future compared to the latest snapshot, we return the latest snapshot ID. + * + * E.g.: if we have snapshots S1, S2, S3 committed at times T3, T6, T9 respectively (T0 = start of epoch), then: + * - from T0 to T2 -> returns empty + * - from T3 to T5 -> returns S1 + * - from T6 to T8 -> returns S2 + * - from T9 to T∞ -> returns S3 + * + * @param table the table whose snapshot ID we are trying to find + * @param time the timestamp provided in milliseconds + * @return the snapshot ID corresponding to the time + */ + public static Optional findSnapshotForTimestamp(Table table, long time) { +if (table.history().get(0).timestampMillis() > time) { + return Optional.empty(); +} + +for (Snapshot snapshot : table.snapshots()) { Review Comment: Thanks for the explanation! Issue Time Tracking --- Worklog Id: (was: 759172) Time Spent: 1.5h (was: 1h 20m) > Support range-based time travel queries for Iceberg > --- > > Key: HIVE-26151 > URL: https://issues.apache.org/jira/browse/HIVE-26151 > Project: Hive > Issue Type: New Feature >Reporter: Marton Bod >Assignee: Marton Bod >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > Allow querying which records have been inserted during a certain time window > for Iceberg tables. The Iceberg TableScan API provides an implementation for > that, so most of the work would go into adding syntax support and > transporting the startTime and endTime parameters to the Iceberg input format. > Proposed new syntax: > SELECT * FROM table FOR SYSTEM_TIME FROM '' TO '' > SELECT * FROM table FOR SYSTEM_VERSION FROM TO > (the TO clause is optional in both cases) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg
[ https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759152=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759152 ] ASF GitHub Bot logged work on HIVE-26151: - Author: ASF GitHub Bot Created on: 20/Apr/22 12:04 Start Date: 20/Apr/22 12:04 Worklog Time Spent: 10m Work Description: marton-bod commented on code in PR #3222: URL: https://github.com/apache/hive/pull/3222#discussion_r854053748 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java: ## @@ -163,4 +165,32 @@ public static void updateSpec(Configuration configuration, Table table) { public static boolean isBucketed(Table table) { return table.spec().fields().stream().anyMatch(f -> f.transform().toString().startsWith("bucket[")); } + + /** + * Returns the snapshot ID which is immediately before (or exactly at) the timestamp provided in millis. + * If the timestamp provided is before the first snapshot of the table, we return an empty optional. + * If the timestamp provided is in the future compared to the latest snapshot, we return the latest snapshot ID. + * + * E.g.: if we have snapshots S1, S2, S3 committed at times T3, T6, T9 respectively (T0 = start of epoch), then: + * - from T0 to T2 -> returns empty + * - from T3 to T5 -> returns S1 + * - from T6 to T8 -> returns S2 + * - from T9 to T∞ -> returns S3 + * + * @param table the table whose snapshot ID we are trying to find + * @param time the timestamp provided in milliseconds + * @return the snapshot ID corresponding to the time + */ + public static Optional findSnapshotForTimestamp(Table table, long time) { +if (table.history().get(0).timestampMillis() > time) { + return Optional.empty(); +} + +for (Snapshot snapshot : table.snapshots()) { Review Comment: Looks like the snapshots are ordered by commit time. Whenever there's a commit, we take the existing list of the snapshots in the `TableMetadata.Builder` [here](https://github.com/apache/iceberg/blob/9618147b6de8f8627052a205b86e45263394b0c2/core/src/main/java/org/apache/iceberg/TableMetadata.java#L817), and simply append the new snapshot to the end [here](https://github.com/apache/iceberg/blob/9618147b6de8f8627052a205b86e45263394b0c2/core/src/main/java/org/apache/iceberg/TableMetadata.java#L994). And since it's a List, the iteration order will be deterministic. Issue Time Tracking --- Worklog Id: (was: 759152) Time Spent: 1h 20m (was: 1h 10m) > Support range-based time travel queries for Iceberg > --- > > Key: HIVE-26151 > URL: https://issues.apache.org/jira/browse/HIVE-26151 > Project: Hive > Issue Type: New Feature >Reporter: Marton Bod >Assignee: Marton Bod >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > Allow querying which records have been inserted during a certain time window > for Iceberg tables. The Iceberg TableScan API provides an implementation for > that, so most of the work would go into adding syntax support and > transporting the startTime and endTime parameters to the Iceberg input format. > Proposed new syntax: > SELECT * FROM table FOR SYSTEM_TIME FROM '' TO '' > SELECT * FROM table FOR SYSTEM_VERSION FROM TO > (the TO clause is optional in both cases) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg
[ https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759150=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759150 ] ASF GitHub Bot logged work on HIVE-26151: - Author: ASF GitHub Bot Created on: 20/Apr/22 12:03 Start Date: 20/Apr/22 12:03 Worklog Time Spent: 10m Work Description: marton-bod commented on code in PR #3222: URL: https://github.com/apache/hive/pull/3222#discussion_r854053748 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java: ## @@ -163,4 +165,32 @@ public static void updateSpec(Configuration configuration, Table table) { public static boolean isBucketed(Table table) { return table.spec().fields().stream().anyMatch(f -> f.transform().toString().startsWith("bucket[")); } + + /** + * Returns the snapshot ID which is immediately before (or exactly at) the timestamp provided in millis. + * If the timestamp provided is before the first snapshot of the table, we return an empty optional. + * If the timestamp provided is in the future compared to the latest snapshot, we return the latest snapshot ID. + * + * E.g.: if we have snapshots S1, S2, S3 committed at times T3, T6, T9 respectively (T0 = start of epoch), then: + * - from T0 to T2 -> returns empty + * - from T3 to T5 -> returns S1 + * - from T6 to T8 -> returns S2 + * - from T9 to T∞ -> returns S3 + * + * @param table the table whose snapshot ID we are trying to find + * @param time the timestamp provided in milliseconds + * @return the snapshot ID corresponding to the time + */ + public static Optional findSnapshotForTimestamp(Table table, long time) { +if (table.history().get(0).timestampMillis() > time) { + return Optional.empty(); +} + +for (Snapshot snapshot : table.snapshots()) { Review Comment: Looks like the snapshots are ordered by commit time. Whenever there's a commit, we take the existing list of the snapshots [here](https://github.com/apache/iceberg/blob/9618147b6de8f8627052a205b86e45263394b0c2/core/src/main/java/org/apache/iceberg/TableMetadata.java#L817), and simply append the new snapshot to the end [here](https://github.com/apache/iceberg/blob/9618147b6de8f8627052a205b86e45263394b0c2/core/src/main/java/org/apache/iceberg/TableMetadata.java#L994). And since it's a List, the iteration order will be deterministic. Issue Time Tracking --- Worklog Id: (was: 759150) Time Spent: 1h 10m (was: 1h) > Support range-based time travel queries for Iceberg > --- > > Key: HIVE-26151 > URL: https://issues.apache.org/jira/browse/HIVE-26151 > Project: Hive > Issue Type: New Feature >Reporter: Marton Bod >Assignee: Marton Bod >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > Allow querying which records have been inserted during a certain time window > for Iceberg tables. The Iceberg TableScan API provides an implementation for > that, so most of the work would go into adding syntax support and > transporting the startTime and endTime parameters to the Iceberg input format. > Proposed new syntax: > SELECT * FROM table FOR SYSTEM_TIME FROM '' TO '' > SELECT * FROM table FOR SYSTEM_VERSION FROM TO > (the TO clause is optional in both cases) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26009) Determine number of buckets for implicitly bucketed ACIDv2 tables
[ https://issues.apache.org/jira/browse/HIVE-26009?focusedWorklogId=759108=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759108 ] ASF GitHub Bot logged work on HIVE-26009: - Author: ASF GitHub Bot Created on: 20/Apr/22 10:38 Start Date: 20/Apr/22 10:38 Worklog Time Spent: 10m Work Description: simhadri-g opened a new pull request, #3224: URL: https://github.com/apache/hive/pull/3224 ### What changes were proposed in this pull request? The change prevents reducer writing to ORC files to run with parallelism 1 when the tables are bucketed. [HIVE-26009](https://issues.apache.org/jira/browse/HIVE-26009) and [HIVE-25611](https://issues.apache.org/jira/browse/HIVE-25611) ### Why are the changes needed? The numberOfBuckets for implicitly bucketed tables is set to -1 by default. When this is the case, it is left to hive to estimate the number of reducers required. This estimate is not optimal in all cases. Also, when we pick a large value for hive.exec.reducers.bytes.per.reducer before running the MERGE query. This forces the reducer writing to ORC files to run with parallelism 1 simulating a scenario where we have many buckets in the table and the choice of parallelism does not take them into account this can lead to a significant bottleneck in performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qtests and manual tests. Issue Time Tracking --- Worklog Id: (was: 759108) Remaining Estimate: 0h Time Spent: 10m > Determine number of buckets for implicitly bucketed ACIDv2 tables > -- > > Key: HIVE-26009 > URL: https://issues.apache.org/jira/browse/HIVE-26009 > Project: Hive > Issue Type: Improvement >Reporter: Simhadri G >Assignee: Simhadri G >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Hive tries to set number of reducers equal to number of buckets here: > [https://github.com/apache/hive/blob/9857c4e584384f7b0a49c34bc2bdf876c2ea1503/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L6958] > > > The numberOfBuckets for implicitly bucketed tables is set to -1 by default. > When this is the case, it is left to hive to estimate the number of reducers > required the job, based on job input, and configuration parameters. > [https://github.com/apache/hive/blob/9857c4e584384f7b0a49c34bc2bdf876c2ea1503/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L3369] > > This estimate is not optimal in all cases. In the worst case, it case result > in a single reducer being launched , which can lead to a significant > bottleneck in performance . > > Ideally, the number of reducers launched should equal to number of buckets, > which is the case for explicitly bucketed tables. > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (HIVE-26009) Determine number of buckets for implicitly bucketed ACIDv2 tables
[ https://issues.apache.org/jira/browse/HIVE-26009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-26009: -- Labels: pull-request-available (was: ) > Determine number of buckets for implicitly bucketed ACIDv2 tables > -- > > Key: HIVE-26009 > URL: https://issues.apache.org/jira/browse/HIVE-26009 > Project: Hive > Issue Type: Improvement >Reporter: Simhadri G >Assignee: Simhadri G >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Hive tries to set number of reducers equal to number of buckets here: > [https://github.com/apache/hive/blob/9857c4e584384f7b0a49c34bc2bdf876c2ea1503/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L6958] > > > The numberOfBuckets for implicitly bucketed tables is set to -1 by default. > When this is the case, it is left to hive to estimate the number of reducers > required the job, based on job input, and configuration parameters. > [https://github.com/apache/hive/blob/9857c4e584384f7b0a49c34bc2bdf876c2ea1503/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L3369] > > This estimate is not optimal in all cases. In the worst case, it case result > in a single reducer being launched , which can lead to a significant > bottleneck in performance . > > Ideally, the number of reducers launched should equal to number of buckets, > which is the case for explicitly bucketed tables. > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg
[ https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759068=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759068 ] ASF GitHub Bot logged work on HIVE-26151: - Author: ASF GitHub Bot Created on: 20/Apr/22 09:29 Start Date: 20/Apr/22 09:29 Worklog Time Spent: 10m Work Description: marton-bod commented on code in PR #3222: URL: https://github.com/apache/hive/pull/3222#discussion_r853926634 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java: ## @@ -163,4 +165,32 @@ public static void updateSpec(Configuration configuration, Table table) { public static boolean isBucketed(Table table) { return table.spec().fields().stream().anyMatch(f -> f.transform().toString().startsWith("bucket[")); } + + /** + * Returns the snapshot ID which is immediately before (or exactly at) the timestamp provided in millis. + * If the timestamp provided is before the first snapshot of the table, we return an empty optional. + * If the timestamp provided is in the future compared to the latest snapshot, we return the latest snapshot ID. + * + * E.g.: if we have snapshots S1, S2, S3 committed at times T3, T6, T9 respectively (T0 = start of epoch), then: + * - from T0 to T2 -> returns empty + * - from T3 to T5 -> returns S1 + * - from T6 to T8 -> returns S2 + * - from T9 to T∞ -> returns S3 + * + * @param table the table whose snapshot ID we are trying to find + * @param time the timestamp provided in milliseconds + * @return the snapshot ID corresponding to the time + */ + public static Optional findSnapshotForTimestamp(Table table, long time) { +if (table.history().get(0).timestampMillis() > time) { + return Optional.empty(); +} + +for (Snapshot snapshot : table.snapshots()) { Review Comment: Actually this is only true for V2 tables. Let me debug into it a bit to see what's happening for V1 Issue Time Tracking --- Worklog Id: (was: 759068) Time Spent: 1h (was: 50m) > Support range-based time travel queries for Iceberg > --- > > Key: HIVE-26151 > URL: https://issues.apache.org/jira/browse/HIVE-26151 > Project: Hive > Issue Type: New Feature >Reporter: Marton Bod >Assignee: Marton Bod >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > Allow querying which records have been inserted during a certain time window > for Iceberg tables. The Iceberg TableScan API provides an implementation for > that, so most of the work would go into adding syntax support and > transporting the startTime and endTime parameters to the Iceberg input format. > Proposed new syntax: > SELECT * FROM table FOR SYSTEM_TIME FROM '' TO '' > SELECT * FROM table FOR SYSTEM_VERSION FROM TO > (the TO clause is optional in both cases) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg
[ https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759067=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759067 ] ASF GitHub Bot logged work on HIVE-26151: - Author: ASF GitHub Bot Created on: 20/Apr/22 09:27 Start Date: 20/Apr/22 09:27 Worklog Time Spent: 10m Work Description: marton-bod commented on code in PR #3222: URL: https://github.com/apache/hive/pull/3222#discussion_r853924951 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java: ## @@ -207,6 +218,39 @@ public RecordReader createRecordReader(InputSplit split, TaskAttemptCon return new IcebergRecordReader<>(); } + private static TableScan scanWithTimeRange(Table table, Configuration conf, TableScan scan, long fromTime) { +// let's find the corresponding snapshot ID - if the fromTime is before the table creation happened, let's use +// the first snapshot of the table +long fromSnapshot = IcebergTableUtil.findSnapshotForTimestamp(table, fromTime) +.orElseGet(() -> table.history().get(0).snapshotId()); +if (fromSnapshot == table.currentSnapshot().snapshotId()) { + throw new IllegalArgumentException( + "Provided FROM timestamp must be earlier than the latest snapshot of the table."); +} +long toTime = conf.getLong(InputFormatConfig.TO_TIMESTAMP, -1); +if (toTime != -1) { + if (fromTime >= toTime) { +throw new IllegalArgumentException( +"Provided FROM timestamp must precede the provided TO timestamp."); + } + long toSnapshot = IcebergTableUtil.findSnapshotForTimestamp(table, toTime) + .orElseThrow(() -> new IllegalArgumentException( + "Provided TO timestamp must be after the first snapshot of the table.")); + return scan.appendsBetween(fromSnapshot, toSnapshot); +} else { + return scan.appendsAfter(fromSnapshot); +} + } + + private static TableScan scanWithVersionRange(Configuration conf, TableScan scan, long fromSnapshot) { +long toSnapshot = conf.getLong(InputFormatConfig.TO_VERSION, -1); Review Comment: Sure Issue Time Tracking --- Worklog Id: (was: 759067) Time Spent: 50m (was: 40m) > Support range-based time travel queries for Iceberg > --- > > Key: HIVE-26151 > URL: https://issues.apache.org/jira/browse/HIVE-26151 > Project: Hive > Issue Type: New Feature >Reporter: Marton Bod >Assignee: Marton Bod >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Allow querying which records have been inserted during a certain time window > for Iceberg tables. The Iceberg TableScan API provides an implementation for > that, so most of the work would go into adding syntax support and > transporting the startTime and endTime parameters to the Iceberg input format. > Proposed new syntax: > SELECT * FROM table FOR SYSTEM_TIME FROM '' TO '' > SELECT * FROM table FOR SYSTEM_VERSION FROM TO > (the TO clause is optional in both cases) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg
[ https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759066=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759066 ] ASF GitHub Bot logged work on HIVE-26151: - Author: ASF GitHub Bot Created on: 20/Apr/22 09:26 Start Date: 20/Apr/22 09:26 Worklog Time Spent: 10m Work Description: marton-bod commented on code in PR #3222: URL: https://github.com/apache/hive/pull/3222#discussion_r853924488 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java: ## @@ -207,6 +218,39 @@ public RecordReader createRecordReader(InputSplit split, TaskAttemptCon return new IcebergRecordReader<>(); } + private static TableScan scanWithTimeRange(Table table, Configuration conf, TableScan scan, long fromTime) { +// let's find the corresponding snapshot ID - if the fromTime is before the table creation happened, let's use +// the first snapshot of the table +long fromSnapshot = IcebergTableUtil.findSnapshotForTimestamp(table, fromTime) +.orElseGet(() -> table.history().get(0).snapshotId()); +if (fromSnapshot == table.currentSnapshot().snapshotId()) { + throw new IllegalArgumentException( + "Provided FROM timestamp must be earlier than the latest snapshot of the table."); +} +long toTime = conf.getLong(InputFormatConfig.TO_TIMESTAMP, -1); Review Comment: Sure ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java: ## @@ -207,6 +218,39 @@ public RecordReader createRecordReader(InputSplit split, TaskAttemptCon return new IcebergRecordReader<>(); } + private static TableScan scanWithTimeRange(Table table, Configuration conf, TableScan scan, long fromTime) { +// let's find the corresponding snapshot ID - if the fromTime is before the table creation happened, let's use +// the first snapshot of the table +long fromSnapshot = IcebergTableUtil.findSnapshotForTimestamp(table, fromTime) +.orElseGet(() -> table.history().get(0).snapshotId()); +if (fromSnapshot == table.currentSnapshot().snapshotId()) { + throw new IllegalArgumentException( + "Provided FROM timestamp must be earlier than the latest snapshot of the table."); +} +long toTime = conf.getLong(InputFormatConfig.TO_TIMESTAMP, -1); +if (toTime != -1) { + if (fromTime >= toTime) { Review Comment: Yep, makes sense Issue Time Tracking --- Worklog Id: (was: 759066) Time Spent: 40m (was: 0.5h) > Support range-based time travel queries for Iceberg > --- > > Key: HIVE-26151 > URL: https://issues.apache.org/jira/browse/HIVE-26151 > Project: Hive > Issue Type: New Feature >Reporter: Marton Bod >Assignee: Marton Bod >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Allow querying which records have been inserted during a certain time window > for Iceberg tables. The Iceberg TableScan API provides an implementation for > that, so most of the work would go into adding syntax support and > transporting the startTime and endTime parameters to the Iceberg input format. > Proposed new syntax: > SELECT * FROM table FOR SYSTEM_TIME FROM '' TO '' > SELECT * FROM table FOR SYSTEM_VERSION FROM TO > (the TO clause is optional in both cases) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg
[ https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759063=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759063 ] ASF GitHub Bot logged work on HIVE-26151: - Author: ASF GitHub Bot Created on: 20/Apr/22 09:21 Start Date: 20/Apr/22 09:21 Worklog Time Spent: 10m Work Description: marton-bod commented on code in PR #3222: URL: https://github.com/apache/hive/pull/3222#discussion_r853918915 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java: ## @@ -163,4 +165,32 @@ public static void updateSpec(Configuration configuration, Table table) { public static boolean isBucketed(Table table) { return table.spec().fields().stream().anyMatch(f -> f.transform().toString().startsWith("bucket[")); } + + /** + * Returns the snapshot ID which is immediately before (or exactly at) the timestamp provided in millis. + * If the timestamp provided is before the first snapshot of the table, we return an empty optional. + * If the timestamp provided is in the future compared to the latest snapshot, we return the latest snapshot ID. + * + * E.g.: if we have snapshots S1, S2, S3 committed at times T3, T6, T9 respectively (T0 = start of epoch), then: + * - from T0 to T2 -> returns empty + * - from T3 to T5 -> returns S1 + * - from T6 to T8 -> returns S2 + * - from T9 to T∞ -> returns S3 + * + * @param table the table whose snapshot ID we are trying to find + * @param time the timestamp provided in milliseconds + * @return the snapshot ID corresponding to the time + */ + public static Optional findSnapshotForTimestamp(Table table, long time) { +if (table.history().get(0).timestampMillis() > time) { + return Optional.empty(); +} + +for (Snapshot snapshot : table.snapshots()) { Review Comment: That's a good question. The snapshots come from `TableMetadata.snapshots()` which returns a `List`. The snapshots seem to be sorted by sequence number, which means it's also sorted by snapshot time millis: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableMetadata.java#L982-L990 Issue Time Tracking --- Worklog Id: (was: 759063) Time Spent: 0.5h (was: 20m) > Support range-based time travel queries for Iceberg > --- > > Key: HIVE-26151 > URL: https://issues.apache.org/jira/browse/HIVE-26151 > Project: Hive > Issue Type: New Feature >Reporter: Marton Bod >Assignee: Marton Bod >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Allow querying which records have been inserted during a certain time window > for Iceberg tables. The Iceberg TableScan API provides an implementation for > that, so most of the work would go into adding syntax support and > transporting the startTime and endTime parameters to the Iceberg input format. > Proposed new syntax: > SELECT * FROM table FOR SYSTEM_TIME FROM '' TO '' > SELECT * FROM table FOR SYSTEM_VERSION FROM TO > (the TO clause is optional in both cases) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg
[ https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759056=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759056 ] ASF GitHub Bot logged work on HIVE-26151: - Author: ASF GitHub Bot Created on: 20/Apr/22 09:00 Start Date: 20/Apr/22 09:00 Worklog Time Spent: 10m Work Description: lcspinter commented on code in PR #3222: URL: https://github.com/apache/hive/pull/3222#discussion_r853859878 ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java: ## @@ -207,6 +218,39 @@ public RecordReader createRecordReader(InputSplit split, TaskAttemptCon return new IcebergRecordReader<>(); } + private static TableScan scanWithTimeRange(Table table, Configuration conf, TableScan scan, long fromTime) { +// let's find the corresponding snapshot ID - if the fromTime is before the table creation happened, let's use +// the first snapshot of the table +long fromSnapshot = IcebergTableUtil.findSnapshotForTimestamp(table, fromTime) +.orElseGet(() -> table.history().get(0).snapshotId()); +if (fromSnapshot == table.currentSnapshot().snapshotId()) { + throw new IllegalArgumentException( + "Provided FROM timestamp must be earlier than the latest snapshot of the table."); +} +long toTime = conf.getLong(InputFormatConfig.TO_TIMESTAMP, -1); +if (toTime != -1) { + if (fromTime >= toTime) { Review Comment: I think we can move this check to the beginning of the method, to spare some execution time. ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java: ## @@ -163,4 +165,32 @@ public static void updateSpec(Configuration configuration, Table table) { public static boolean isBucketed(Table table) { return table.spec().fields().stream().anyMatch(f -> f.transform().toString().startsWith("bucket[")); } + + /** + * Returns the snapshot ID which is immediately before (or exactly at) the timestamp provided in millis. + * If the timestamp provided is before the first snapshot of the table, we return an empty optional. + * If the timestamp provided is in the future compared to the latest snapshot, we return the latest snapshot ID. + * + * E.g.: if we have snapshots S1, S2, S3 committed at times T3, T6, T9 respectively (T0 = start of epoch), then: + * - from T0 to T2 -> returns empty + * - from T3 to T5 -> returns S1 + * - from T6 to T8 -> returns S2 + * - from T9 to T∞ -> returns S3 + * + * @param table the table whose snapshot ID we are trying to find + * @param time the timestamp provided in milliseconds + * @return the snapshot ID corresponding to the time + */ + public static Optional findSnapshotForTimestamp(Table table, long time) { +if (table.history().get(0).timestampMillis() > time) { + return Optional.empty(); +} + +for (Snapshot snapshot : table.snapshots()) { Review Comment: Are we certain that the table.snapshots() returns a list sorted by snapshot time? ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java: ## @@ -207,6 +218,39 @@ public RecordReader createRecordReader(InputSplit split, TaskAttemptCon return new IcebergRecordReader<>(); } + private static TableScan scanWithTimeRange(Table table, Configuration conf, TableScan scan, long fromTime) { +// let's find the corresponding snapshot ID - if the fromTime is before the table creation happened, let's use +// the first snapshot of the table +long fromSnapshot = IcebergTableUtil.findSnapshotForTimestamp(table, fromTime) +.orElseGet(() -> table.history().get(0).snapshotId()); +if (fromSnapshot == table.currentSnapshot().snapshotId()) { + throw new IllegalArgumentException( + "Provided FROM timestamp must be earlier than the latest snapshot of the table."); +} +long toTime = conf.getLong(InputFormatConfig.TO_TIMESTAMP, -1); Review Comment: nit: Can we move the toTime to the method param? ## iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java: ## @@ -207,6 +218,39 @@ public RecordReader createRecordReader(InputSplit split, TaskAttemptCon return new IcebergRecordReader<>(); } + private static TableScan scanWithTimeRange(Table table, Configuration conf, TableScan scan, long fromTime) { +// let's find the corresponding snapshot ID - if the fromTime is before the table creation happened, let's use +// the first snapshot of the table +long fromSnapshot = IcebergTableUtil.findSnapshotForTimestamp(table, fromTime) +.orElseGet(() -> table.history().get(0).snapshotId()); +if (fromSnapshot == table.currentSnapshot().snapshotId()) { + throw new IllegalArgumentException( +
[jira] [Work logged] (HIVE-26074) PTF Vectorization: BoundaryScanner for varchar
[ https://issues.apache.org/jira/browse/HIVE-26074?focusedWorklogId=759044=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759044 ] ASF GitHub Bot logged work on HIVE-26074: - Author: ASF GitHub Bot Created on: 20/Apr/22 08:31 Start Date: 20/Apr/22 08:31 Worklog Time Spent: 10m Work Description: ayushtkn commented on code in PR #3187: URL: https://github.com/apache/hive/pull/3187#discussion_r853870852 ## ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/ValueBoundaryScanner.java: ## @@ -768,6 +774,9 @@ public static SingleValueBoundaryScanner getBoundaryScanner(BoundaryDef start, B case "string": return new StringPrimitiveValueBoundaryScanner(start, end, exprDef, nullsLast); default: + if (typeString.startsWith("char") || typeString.startsWith("varchar")) { Review Comment: Done. As discussed I pulled all of them together to avoid multiple branches. Issue Time Tracking --- Worklog Id: (was: 759044) Time Spent: 1h 20m (was: 1h 10m) > PTF Vectorization: BoundaryScanner for varchar > -- > > Key: HIVE-26074 > URL: https://issues.apache.org/jira/browse/HIVE-26074 > Project: Hive > Issue Type: Bug >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > HIVE-24761 should be extended for varchar, otherwise it fails on varchar type > {code} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Internal Error: > attempt to setup a Window for typeString: 'varchar(170)' > at > org.apache.hadoop.hive.ql.udf.ptf.SingleValueBoundaryScanner.getBoundaryScanner(ValueBoundaryScanner.java:773) > at > org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner$MultiPrimitiveValueBoundaryScanner. (ValueBoundaryScanner.java:1257) > at > org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:1237) > at > org.apache.hadoop.hive.ql.udf.ptf.ValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:327) > at > org.apache.hadoop.hive.ql.udf.ptf.PTFRangeUtil.getRange(PTFRangeUtil.java:40) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFGroupBatches.finishPartition(VectorPTFGroupBatches.java:442) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.finishPartition(VectorPTFOperator.java:631) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.closeOp(VectorPTFOperator.java:782) > at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:731) > at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:755) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:383) > ... 16 more > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26074) PTF Vectorization: BoundaryScanner for varchar
[ https://issues.apache.org/jira/browse/HIVE-26074?focusedWorklogId=759040=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759040 ] ASF GitHub Bot logged work on HIVE-26074: - Author: ASF GitHub Bot Created on: 20/Apr/22 08:30 Start Date: 20/Apr/22 08:30 Worklog Time Spent: 10m Work Description: ayushtkn commented on code in PR #3187: URL: https://github.com/apache/hive/pull/3187#discussion_r853869839 ## ql/src/test/queries/clientpositive/vector_ptf_bounded_start.q: ## @@ -3,24 +3,31 @@ set hive.vectorized.execution.enabled=true; set hive.vectorized.execution.ptf.enabled=true; set hive.fetch.task.conversion=none; -CREATE TABLE vector_ptf_part_simple_text(p_mfgr string, p_name string, p_date date, p_retailprice double, rowindex int) +CREATE TABLE vector_ptf_part_simple_text(p_mfgr string, p_name string, p_date date, p_retailprice double, +p_type char(1), p_varchar varchar(5), rowindex int) ROW FORMAT DELIMITED -FIELDS TERMINATED BY '\t' +FIELDS TERMINATED BY ',' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '../../data/files/vector_ptf_part_simple_all_datatypes.txt' OVERWRITE INTO TABLE vector_ptf_part_simple_text; +SELECT * from vector_ptf_part_simple_text; + CREATE TABLE vector_ptf_part_simple_orc (p_mfgr string, p_name string, p_date date, p_timestamp timestamp, -p_int int, p_retailprice double, p_decimal decimal(10,4), rowindex int) stored as orc; +p_int int, p_retailprice double, p_decimal decimal(10,4), p_type char(1), p_varchar varchar(5),rowindex int) stored Review Comment: Changed. Issue Time Tracking --- Worklog Id: (was: 759040) Time Spent: 1h (was: 50m) > PTF Vectorization: BoundaryScanner for varchar > -- > > Key: HIVE-26074 > URL: https://issues.apache.org/jira/browse/HIVE-26074 > Project: Hive > Issue Type: Bug >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > HIVE-24761 should be extended for varchar, otherwise it fails on varchar type > {code} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Internal Error: > attempt to setup a Window for typeString: 'varchar(170)' > at > org.apache.hadoop.hive.ql.udf.ptf.SingleValueBoundaryScanner.getBoundaryScanner(ValueBoundaryScanner.java:773) > at > org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner$MultiPrimitiveValueBoundaryScanner. (ValueBoundaryScanner.java:1257) > at > org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:1237) > at > org.apache.hadoop.hive.ql.udf.ptf.ValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:327) > at > org.apache.hadoop.hive.ql.udf.ptf.PTFRangeUtil.getRange(PTFRangeUtil.java:40) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFGroupBatches.finishPartition(VectorPTFGroupBatches.java:442) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.finishPartition(VectorPTFOperator.java:631) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.closeOp(VectorPTFOperator.java:782) > at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:731) > at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:755) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:383) > ... 16 more > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26074) PTF Vectorization: BoundaryScanner for varchar
[ https://issues.apache.org/jira/browse/HIVE-26074?focusedWorklogId=759042=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759042 ] ASF GitHub Bot logged work on HIVE-26074: - Author: ASF GitHub Bot Created on: 20/Apr/22 08:30 Start Date: 20/Apr/22 08:30 Worklog Time Spent: 10m Work Description: ayushtkn commented on code in PR #3187: URL: https://github.com/apache/hive/pull/3187#discussion_r853870053 ## ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/ValueBoundaryScanner.java: ## @@ -1214,6 +1223,55 @@ public boolean isEqualPrimitive(String s1, String s2) { } } +class CharValueBoundaryScanner extends SingleValueBoundaryScanner { + public CharValueBoundaryScanner(BoundaryDef start, BoundaryDef end, + OrderExpressionDef expressionDef, boolean nullsLast) { +super(start, end, expressionDef, nullsLast); + } + + @Override + public boolean isDistanceGreater(Object v1, Object v2, int amt) { +HiveChar s1 = PrimitiveObjectInspectorUtils.getHiveChar(v1, +(PrimitiveObjectInspector) expressionDef.getOI()); +HiveChar s2 = PrimitiveObjectInspectorUtils.getHiveChar(v2, +(PrimitiveObjectInspector) expressionDef.getOI()); +return s1 != null && s2 != null && s1.compareTo(s2) > 0; + } + + @Override + public boolean isEqual(Object v1, Object v2) { +HiveChar s1 = PrimitiveObjectInspectorUtils.getHiveChar(v1, +(PrimitiveObjectInspector) expressionDef.getOI()); +HiveChar s2 = PrimitiveObjectInspectorUtils.getHiveChar(v2, +(PrimitiveObjectInspector) expressionDef.getOI()); +return (s1 == null && s2 == null) || (s1 != null && s1.equals(s2)); + } +} + +class VarcharValueBoundaryScanner extends SingleValueBoundaryScanner { + public VarcharValueBoundaryScanner(BoundaryDef start, BoundaryDef end, + OrderExpressionDef expressionDef, boolean nullsLast) { +super(start, end, expressionDef, nullsLast); + } + + @Override + public boolean isDistanceGreater(Object v1, Object v2, int amt) { +HiveVarchar s1 = PrimitiveObjectInspectorUtils.getHiveVarchar(v1, +(PrimitiveObjectInspector) expressionDef.getOI()); +HiveVarchar s2 = PrimitiveObjectInspectorUtils.getHiveVarchar(v2, +(PrimitiveObjectInspector) expressionDef.getOI()); +return s1 != null && s2 != null && s1.compareTo(s2) > 0; + } + + @Override + public boolean isEqual(Object v1, Object v2) { Review Comment: Added Issue Time Tracking --- Worklog Id: (was: 759042) Time Spent: 1h 10m (was: 1h) > PTF Vectorization: BoundaryScanner for varchar > -- > > Key: HIVE-26074 > URL: https://issues.apache.org/jira/browse/HIVE-26074 > Project: Hive > Issue Type: Bug >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > HIVE-24761 should be extended for varchar, otherwise it fails on varchar type > {code} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Internal Error: > attempt to setup a Window for typeString: 'varchar(170)' > at > org.apache.hadoop.hive.ql.udf.ptf.SingleValueBoundaryScanner.getBoundaryScanner(ValueBoundaryScanner.java:773) > at > org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner$MultiPrimitiveValueBoundaryScanner. (ValueBoundaryScanner.java:1257) > at > org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:1237) > at > org.apache.hadoop.hive.ql.udf.ptf.ValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:327) > at > org.apache.hadoop.hive.ql.udf.ptf.PTFRangeUtil.getRange(PTFRangeUtil.java:40) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFGroupBatches.finishPartition(VectorPTFGroupBatches.java:442) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.finishPartition(VectorPTFOperator.java:631) > at > org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.closeOp(VectorPTFOperator.java:782) > at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:731) > at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:755) > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:383) > ... 16 more > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (HIVE-26074) PTF Vectorization: BoundaryScanner for varchar
[ https://issues.apache.org/jira/browse/HIVE-26074?focusedWorklogId=759009=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759009 ] ASF GitHub Bot logged work on HIVE-26074: - Author: ASF GitHub Bot Created on: 20/Apr/22 07:35 Start Date: 20/Apr/22 07:35 Worklog Time Spent: 10m Work Description: abstractdog commented on code in PR #3187: URL: https://github.com/apache/hive/pull/3187#discussion_r845932714 ## ql/src/test/queries/clientpositive/vector_ptf_bounded_start.q: ## @@ -3,24 +3,31 @@ set hive.vectorized.execution.enabled=true; set hive.vectorized.execution.ptf.enabled=true; set hive.fetch.task.conversion=none; -CREATE TABLE vector_ptf_part_simple_text(p_mfgr string, p_name string, p_date date, p_retailprice double, rowindex int) +CREATE TABLE vector_ptf_part_simple_text(p_mfgr string, p_name string, p_date date, p_retailprice double, +p_type char(1), p_varchar varchar(5), rowindex int) ROW FORMAT DELIMITED -FIELDS TERMINATED BY '\t' +FIELDS TERMINATED BY ',' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '../../data/files/vector_ptf_part_simple_all_datatypes.txt' OVERWRITE INTO TABLE vector_ptf_part_simple_text; +SELECT * from vector_ptf_part_simple_text; + CREATE TABLE vector_ptf_part_simple_orc (p_mfgr string, p_name string, p_date date, p_timestamp timestamp, -p_int int, p_retailprice double, p_decimal decimal(10,4), rowindex int) stored as orc; +p_int int, p_retailprice double, p_decimal decimal(10,4), p_type char(1), p_varchar varchar(5),rowindex int) stored Review Comment: let this be p_char instead of p_type ## ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/ValueBoundaryScanner.java: ## @@ -1214,6 +1223,55 @@ public boolean isEqualPrimitive(String s1, String s2) { } } +class CharValueBoundaryScanner extends SingleValueBoundaryScanner { + public CharValueBoundaryScanner(BoundaryDef start, BoundaryDef end, + OrderExpressionDef expressionDef, boolean nullsLast) { +super(start, end, expressionDef, nullsLast); + } + + @Override + public boolean isDistanceGreater(Object v1, Object v2, int amt) { +HiveChar s1 = PrimitiveObjectInspectorUtils.getHiveChar(v1, +(PrimitiveObjectInspector) expressionDef.getOI()); +HiveChar s2 = PrimitiveObjectInspectorUtils.getHiveChar(v2, +(PrimitiveObjectInspector) expressionDef.getOI()); +return s1 != null && s2 != null && s1.compareTo(s2) > 0; + } + + @Override + public boolean isEqual(Object v1, Object v2) { +HiveChar s1 = PrimitiveObjectInspectorUtils.getHiveChar(v1, +(PrimitiveObjectInspector) expressionDef.getOI()); +HiveChar s2 = PrimitiveObjectInspectorUtils.getHiveChar(v2, +(PrimitiveObjectInspector) expressionDef.getOI()); +return (s1 == null && s2 == null) || (s1 != null && s1.equals(s2)); + } +} + +class VarcharValueBoundaryScanner extends SingleValueBoundaryScanner { + public VarcharValueBoundaryScanner(BoundaryDef start, BoundaryDef end, + OrderExpressionDef expressionDef, boolean nullsLast) { +super(start, end, expressionDef, nullsLast); + } + + @Override + public boolean isDistanceGreater(Object v1, Object v2, int amt) { +HiveVarchar s1 = PrimitiveObjectInspectorUtils.getHiveVarchar(v1, +(PrimitiveObjectInspector) expressionDef.getOI()); +HiveVarchar s2 = PrimitiveObjectInspectorUtils.getHiveVarchar(v2, +(PrimitiveObjectInspector) expressionDef.getOI()); +return s1 != null && s2 != null && s1.compareTo(s2) > 0; + } + + @Override + public boolean isEqual(Object v1, Object v2) { Review Comment: can you please add isEqual testcase to TestValueBoundaryScanner? ## ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/ValueBoundaryScanner.java: ## @@ -768,6 +774,9 @@ public static SingleValueBoundaryScanner getBoundaryScanner(BoundaryDef start, B case "string": return new StringPrimitiveValueBoundaryScanner(start, end, exprDef, nullsLast); default: + if (typeString.startsWith("char") || typeString.startsWith("varchar")) { Review Comment: the same is handled for decimal above: ``` if (typeString.startsWith("decimal")){ typeString = "decimal"; //DecimalTypeInfo.getTypeName() includes scale/precision: "decimal(10,4)" } ``` Issue Time Tracking --- Worklog Id: (was: 759009) Time Spent: 50m (was: 40m) > PTF Vectorization: BoundaryScanner for varchar > -- > > Key: HIVE-26074 > URL: https://issues.apache.org/jira/browse/HIVE-26074 > Project: Hive > Issue Type: Bug >Reporter: László Bodor >Assignee: László Bodor >Priority: Major > Labels: pull-request-available > Time
[jira] [Updated] (HIVE-26146) Handle missing hive.acid.key.index in the fixacidkeyindex utility
[ https://issues.apache.org/jira/browse/HIVE-26146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alessandro Solimando updated HIVE-26146: Description: There is a utility in hive which can validate/fix corrupted _hive.acid.key.index_: {code:bash} hive --service fixacidkeyindex $orcfilepath {code} At the moment the utility throws a NPE if the _hive.acid.key.index_ metadata entry is missing: {noformat} ERROR checking /hive-dev-box/multistripe_ko_acid.orc java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.validate(FixAcidKeyIndex.java:183) at org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.checkFile(FixAcidKeyIndex.java:147) at org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.checkFiles(FixAcidKeyIndex.java:130) at org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.main(FixAcidKeyIndex.java:106) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:308) at org.apache.hadoop.util.RunJar.main(RunJar.java:222) {noformat} The aim of this ticket is to handle such case in order to support re-generating this metadata entry even when it is missing. was: There is a utility in hive which can validate/fix corrupted _hive.acid.key.index_: {code:bash} hive --service fixacidkeyindex {code} At the moment the utility throws a NPE if the _hive.acid.key.index_ metadata entry is missing: {noformat} ERROR checking /hive-dev-box/multistripe_ko_acid.orc java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.validate(FixAcidKeyIndex.java:183) at org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.checkFile(FixAcidKeyIndex.java:147) at org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.checkFiles(FixAcidKeyIndex.java:130) at org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.main(FixAcidKeyIndex.java:106) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:308) at org.apache.hadoop.util.RunJar.main(RunJar.java:222) {noformat} The aim of this ticket is to handle such case in order to support re-generating this metadata entry even when it is missing. > Handle missing hive.acid.key.index in the fixacidkeyindex utility > - > > Key: HIVE-26146 > URL: https://issues.apache.org/jira/browse/HIVE-26146 > Project: Hive > Issue Type: Improvement > Components: ORC, Transactions >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Assignee: Alessandro Solimando >Priority: Critical > Labels: pull-request-available > Fix For: 4.0.0-alpha-2 > > Time Spent: 0.5h > Remaining Estimate: 0h > > There is a utility in hive which can validate/fix corrupted > _hive.acid.key.index_: > {code:bash} > hive --service fixacidkeyindex $orcfilepath > {code} > At the moment the utility throws a NPE if the _hive.acid.key.index_ metadata > entry is missing: > {noformat} > ERROR checking /hive-dev-box/multistripe_ko_acid.orc > java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.validate(FixAcidKeyIndex.java:183) > at > org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.checkFile(FixAcidKeyIndex.java:147) > at > org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.checkFiles(FixAcidKeyIndex.java:130) > at > org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.main(FixAcidKeyIndex.java:106) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:308) > at org.apache.hadoop.util.RunJar.main(RunJar.java:222) > {noformat} > The aim of this ticket is to handle such case in order to support > re-generating this metadata entry even when it is missing. -- This message was sent by Atlassian Jira (v8.20.7#820007)