[jira] [Comment Edited] (HIVE-24920) TRANSLATED_TO_EXTERNAL tables may write to the same location

2022-04-20 Thread tanghui (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524983#comment-17524983
 ] 

tanghui edited comment on HIVE-24920 at 4/21/22 12:41 AM:
--

After the patch is updated, the partition table location and hdfs data 
directory are displayed normally, but the partition location of the table in 
the SDS in the Hive metabase is still displayed as the location of the old 
table, resulting in no data in the query partition.

 

in beeline:



set hive.create.as.external.legacy=true;

CREATE TABLE part_test(
c1 string
,c2 string
)PARTITIONED BY (dat string)

insert into part_test values ("11","th","20220101")
insert into part_test values ("22","th","20220102")

alter table part_test rename to part_test11;

--this result is null.
select * from part_test11 where dat="20220101";
||part_test.c1||part_test.c2||part_test.dat||
| | | |

-

SDS in the Hive metabase:
select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
TBLS.TBL_ID=SDS.CD_ID;

---
|*LOCATION*|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|

---

 

We need to modify the partition location of the table in SDS to ensure that the 
query results are normal:

https://issues.apache.org/jira/browse/HIVE-26158


was (Author: sanguines):
After the patch is updated, the partition table location and hdfs data 
directory are displayed normally, but the partition location of the table in 
the SDS in the Hive metabase is still displayed as the location of the old 
table, resulting in no data in the query partition.

 

in beeline:



set hive.create.as.external.legacy=true;

CREATE TABLE part_test(
c1 string
,c2 string
)PARTITIONED BY (dat string)

insert into part_test values ("11","th","20220101")
insert into part_test values ("22","th","20220102")

alter table part_test rename to part_test11;

--this result is null.
select * from part_test11 where dat="20220101";
||part_test.c1||part_test.c2||part_test.dat||
| | | |

-

SDS in the Hive metabase:
select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
TBLS.TBL_ID=SDS.CD_ID;

---
|*LOCATION*|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|

---

 

We need to modify the partition location of the table in SDS to ensure that the 
query results are normal

> TRANSLATED_TO_EXTERNAL tables may write to the same location
> 
>
> Key: HIVE-24920
> URL: https://issues.apache.org/jira/browse/HIVE-24920
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: metastore_translator, pull-request-available
> Fix For: 4.0.0, 4.0.0-alpha-1
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> {code}
> create table t (a integer);
> insert into t values(1);
> alter table t rename to t2;
> create table t (a integer); -- I expected an exception from this command 
> (location already exists) but because its an external table no exception
> insert into t values(2);
> select * from t;  -- shows 1 and 2
> drop table t2;-- wipes out data location
> select * from t;  -- empty resultset
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-24969) Predicates may be removed when decorrelating subqueries with lateral

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24969?focusedWorklogId=759665=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759665
 ]

ASF GitHub Bot logged work on HIVE-24969:
-

Author: ASF GitHub Bot
Created on: 21/Apr/22 00:21
Start Date: 21/Apr/22 00:21
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] closed pull request #3018: 
HIVE-24969: Predicates may be removed when decorrelating subqueries with lateral
URL: https://github.com/apache/hive/pull/3018




Issue Time Tracking
---

Worklog Id: (was: 759665)
Time Spent: 3h  (was: 2h 50m)

> Predicates may be removed when decorrelating subqueries with lateral
> 
>
> Key: HIVE-24969
> URL: https://issues.apache.org/jira/browse/HIVE-24969
> Project: Hive
>  Issue Type: Bug
>  Components: Logical Optimizer
>Reporter: Zhihua Deng
>Assignee: Zhihua Deng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Step to reproduce:
> {code:java}
> select count(distinct logItem.triggerId)
> from service_stat_log LATERAL VIEW explode(logItems) LogItemTable AS logItem
> where logItem.dsp in ('delivery', 'ocpa')
> and logItem.iswin = true
> and logItem.adid in (
>  select distinct adId
>  from ad_info
>  where subAccountId in (16010, 14863));  {code}
> For predicates _logItem.dsp in ('delivery', 'ocpa')_  and _logItem.iswin = 
> true_ are removed when doing ppd: JOIN ->   RS  -> LVJ.  The JOIN has 
> candicates: logitem -> [logItem.dsp in ('delivery', 'ocpa'), logItem.iswin = 
> true],when pushing them to the RS followed by LVJ,  none of them are pushed, 
> the candicates of logitem are removed finally by default, which cause to the 
> wrong result.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26072) Enable vectorization for stats gathering (tablescan op)

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26072?focusedWorklogId=759621=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759621
 ]

ASF GitHub Bot logged work on HIVE-26072:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 22:12
Start Date: 20/Apr/22 22:12
Worklog Time Spent: 10m 
  Work Description: ramesh0201 opened a new pull request, #3228:
URL: https://github.com/apache/hive/pull/3228

   …ROugh patch(Do Not merge)
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   




Issue Time Tracking
---

Worklog Id: (was: 759621)
Time Spent: 0.5h  (was: 20m)

> Enable vectorization for stats gathering (tablescan op)
> ---
>
> Key: HIVE-26072
> URL: https://issues.apache.org/jira/browse/HIVE-26072
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Rajesh Balamohan
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> https://issues.apache.org/jira/browse/HIVE-24510 enabled vectorization for 
> compute_bit_vector. 
> But tablescan operator for stats gathering is disabled by default.
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/Vectorizer.java#L2577]
> Need to enable vectorization for this. This can significantly reduce runtimes 
> for analyze statements for large tables.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-21456) Hive Metastore Thrift over HTTP

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-21456?focusedWorklogId=759515=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759515
 ]

ASF GitHub Bot logged work on HIVE-21456:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 19:33
Start Date: 20/Apr/22 19:33
Worklog Time Spent: 10m 
  Work Description: sourabh912 commented on PR #3105:
URL: https://github.com/apache/hive/pull/3105#issuecomment-1104380804

   Thanks for the review @pvary @nrg4878 @yongzhi and @saihemanth-cloudera 




Issue Time Tracking
---

Worklog Id: (was: 759515)
Time Spent: 6h 50m  (was: 6h 40m)

> Hive Metastore Thrift over HTTP
> ---
>
> Key: HIVE-21456
> URL: https://issues.apache.org/jira/browse/HIVE-21456
> Project: Hive
>  Issue Type: New Feature
>  Components: Metastore, Standalone Metastore
>Reporter: Amit Khanna
>Assignee: Sourabh Goyal
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-21456.2.patch, HIVE-21456.3.patch, 
> HIVE-21456.4.patch, HIVE-21456.patch
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Hive Metastore currently doesn't have support for HTTP transport because of 
> which it is not possible to access it via Knox. Adding support for Thrift 
> over HTTP transport will allow the clients to access via Knox



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HIVE-26074) PTF Vectorization: BoundaryScanner for varchar

2022-04-20 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena resolved HIVE-26074.
-
Fix Version/s: 4.0.0-alpha-2
 Assignee: Ayush Saxena  (was: László Bodor)
   Resolution: Fixed

> PTF Vectorization: BoundaryScanner for varchar
> --
>
> Key: HIVE-26074
> URL: https://issues.apache.org/jira/browse/HIVE-26074
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0-alpha-2
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> HIVE-24761 should be extended for varchar, otherwise it fails on varchar type
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Internal Error: 
> attempt to setup a Window for typeString: 'varchar(170)'
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.SingleValueBoundaryScanner.getBoundaryScanner(ValueBoundaryScanner.java:773)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner$MultiPrimitiveValueBoundaryScanner.  (ValueBoundaryScanner.java:1257)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:1237)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.ValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:327)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.PTFRangeUtil.getRange(PTFRangeUtil.java:40)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFGroupBatches.finishPartition(VectorPTFGroupBatches.java:442)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.finishPartition(VectorPTFOperator.java:631)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.closeOp(VectorPTFOperator.java:782)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:731)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:755)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:383)
>   ... 16 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HIVE-26074) PTF Vectorization: BoundaryScanner for varchar

2022-04-20 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525179#comment-17525179
 ] 

Ayush Saxena commented on HIVE-26074:
-

Merged PR to master.

Thanx [~abstractdog] for the review!!!

> PTF Vectorization: BoundaryScanner for varchar
> --
>
> Key: HIVE-26074
> URL: https://issues.apache.org/jira/browse/HIVE-26074
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> HIVE-24761 should be extended for varchar, otherwise it fails on varchar type
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Internal Error: 
> attempt to setup a Window for typeString: 'varchar(170)'
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.SingleValueBoundaryScanner.getBoundaryScanner(ValueBoundaryScanner.java:773)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner$MultiPrimitiveValueBoundaryScanner.  (ValueBoundaryScanner.java:1257)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:1237)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.ValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:327)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.PTFRangeUtil.getRange(PTFRangeUtil.java:40)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFGroupBatches.finishPartition(VectorPTFGroupBatches.java:442)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.finishPartition(VectorPTFOperator.java:631)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.closeOp(VectorPTFOperator.java:782)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:731)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:755)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:383)
>   ... 16 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26074) PTF Vectorization: BoundaryScanner for varchar

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26074?focusedWorklogId=759469=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759469
 ]

ASF GitHub Bot logged work on HIVE-26074:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 18:25
Start Date: 20/Apr/22 18:25
Worklog Time Spent: 10m 
  Work Description: ayushtkn merged PR #3187:
URL: https://github.com/apache/hive/pull/3187




Issue Time Tracking
---

Worklog Id: (was: 759469)
Time Spent: 1.5h  (was: 1h 20m)

> PTF Vectorization: BoundaryScanner for varchar
> --
>
> Key: HIVE-26074
> URL: https://issues.apache.org/jira/browse/HIVE-26074
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> HIVE-24761 should be extended for varchar, otherwise it fails on varchar type
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Internal Error: 
> attempt to setup a Window for typeString: 'varchar(170)'
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.SingleValueBoundaryScanner.getBoundaryScanner(ValueBoundaryScanner.java:773)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner$MultiPrimitiveValueBoundaryScanner.  (ValueBoundaryScanner.java:1257)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:1237)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.ValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:327)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.PTFRangeUtil.getRange(PTFRangeUtil.java:40)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFGroupBatches.finishPartition(VectorPTFGroupBatches.java:442)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.finishPartition(VectorPTFOperator.java:631)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.closeOp(VectorPTFOperator.java:782)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:731)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:755)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:383)
>   ... 16 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HIVE-26159) hive cli is unavailable from hive command

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-26159:
--
Labels: pull-request-available  (was: )

> hive cli is unavailable from hive command
> -
>
> Key: HIVE-26159
> URL: https://issues.apache.org/jira/browse/HIVE-26159
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0-alpha-1
>Reporter: Wechar
>Assignee: Wechar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hive cli is a convenient tool to connect to hive metastore service, but now 
> hive cli can not start even if we use *--service cli* option, it should be a 
> bug of ticket HIVE-24348.
> *Steps to reproduce:*
> {code:bash}
> hive@hive:/root$ /usr/share/hive/bin/hive --service cli --hiveconf 
> hive.metastore.uris=thrift://hive:9084
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> Beeline version 4.0.0-alpha-2-SNAPSHOT by Apache Hive
> beeline> 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26159) hive cli is unavailable from hive command

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26159?focusedWorklogId=759361=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759361
 ]

ASF GitHub Bot logged work on HIVE-26159:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 16:18
Start Date: 20/Apr/22 16:18
Worklog Time Spent: 10m 
  Work Description: wecharyu opened a new pull request, #3227:
URL: https://github.com/apache/hive/pull/3227

   ### What changes were proposed in this pull request?
   
   Hive cli is the default service in hive script, but it can not start even if 
we use `--service cli` option now, I don't think this is what we expect from 
beeline.
   
   ### Why are the changes needed?
   
   Hive cli is a convenient tool to connect to hive metastore service for 
testing or hive taste, and the hive community seems not intend to deprecate 
hive cli so far.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   
   ### How was this patch tested?
   
   Shell script, can test on local with just `hive` command:
   ```bash
   $ $HIVE_HOME/bin/hive
   ```
   




Issue Time Tracking
---

Worklog Id: (was: 759361)
Remaining Estimate: 0h
Time Spent: 10m

> hive cli is unavailable from hive command
> -
>
> Key: HIVE-26159
> URL: https://issues.apache.org/jira/browse/HIVE-26159
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0-alpha-1
>Reporter: Wechar
>Assignee: Wechar
>Priority: Major
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hive cli is a convenient tool to connect to hive metastore service, but now 
> hive cli can not start even if we use *--service cli* option, it should be a 
> bug of ticket HIVE-24348.
> *Steps to reproduce:*
> {code:bash}
> hive@hive:/root$ /usr/share/hive/bin/hive --service cli --hiveconf 
> hive.metastore.uris=thrift://hive:9084
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> Beeline version 4.0.0-alpha-2-SNAPSHOT by Apache Hive
> beeline> 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HIVE-26159) hive cli is unavailable from hive command

2022-04-20 Thread Wechar (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wechar updated HIVE-26159:
--
Description: 
Hive cli is a convenient tool to connect to hive metastore service, but now 
hive cli can not start even if we use *--service cli* option, it should be a 
bug of ticket HIVE-24348.

*Steps to reproduce:*
{code:bash}
hive@hive:/root$ /usr/share/hive/bin/hive --service cli --hiveconf 
hive.metastore.uris=thrift://hive:9084
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Beeline version 4.0.0-alpha-2-SNAPSHOT by Apache Hive
beeline> 
{code}

  was:
Hive cli is a convenient tool to connect to hive metastore service, but now 
hive cli can not start even if we use *--service cli* option, it should be a 
bug of ticket [HIVE-24348|https://issues.apache.org/jira/browse/HIVE-24348].

*Step to reproduce:*
{code:bash}
hive@hive:/root$ /usr/share/hive/bin/hive --service cli --hiveconf 
hive.metastore.uris=thrift://hive:9084
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Beeline version 4.0.0-alpha-2-SNAPSHOT by Apache Hive
beeline> 
{code}




> hive cli is unavailable from hive command
> -
>
> Key: HIVE-26159
> URL: https://issues.apache.org/jira/browse/HIVE-26159
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0-alpha-1
>Reporter: Wechar
>Assignee: Wechar
>Priority: Major
> Fix For: 4.0.0
>
>
> Hive cli is a convenient tool to connect to hive metastore service, but now 
> hive cli can not start even if we use *--service cli* option, it should be a 
> bug of ticket HIVE-24348.
> *Steps to reproduce:*
> {code:bash}
> hive@hive:/root$ /usr/share/hive/bin/hive --service cli --hiveconf 
> hive.metastore.uris=thrift://hive:9084
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> Beeline version 4.0.0-alpha-2-SNAPSHOT by Apache Hive
> beeline> 
> {code}



--

[jira] [Assigned] (HIVE-26159) hive cli is unavailable from hive command

2022-04-20 Thread Wechar (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wechar reassigned HIVE-26159:
-


> hive cli is unavailable from hive command
> -
>
> Key: HIVE-26159
> URL: https://issues.apache.org/jira/browse/HIVE-26159
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0-alpha-1
>Reporter: Wechar
>Assignee: Wechar
>Priority: Major
> Fix For: 4.0.0
>
>
> Hive cli is a convenient tool to connect to hive metastore service, but now 
> hive cli can not start even if we use *--service cli* option, it should be a 
> bug of ticket [HIVE-24348|https://issues.apache.org/jira/browse/HIVE-24348].
> *Step to reproduce:*
> {code:bash}
> hive@hive:/root$ /usr/share/hive/bin/hive --service cli --hiveconf 
> hive.metastore.uris=thrift://hive:9084
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/opt/apache-hive-4.0.0-alpha-2-SNAPSHOT-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/opt/hadoop-3.3.1/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> Beeline version 4.0.0-alpha-2-SNAPSHOT by Apache Hive
> beeline> 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HIVE-26137) Optimized transfer of Iceberg residual expressions from AM to execution

2022-04-20 Thread Jira


[ 
https://issues.apache.org/jira/browse/HIVE-26137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525084#comment-17525084
 ] 

Ádám Szita commented on HIVE-26137:
---

Note: this patch also reverts the hacky solution of HIVE-25967

> Optimized transfer of Iceberg residual expressions from AM to execution
> ---
>
> Key: HIVE-26137
> URL: https://issues.apache.org/jira/browse/HIVE-26137
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ádám Szita
>Assignee: Ádám Szita
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> HIVE-25967 introduced a hack to prevent Iceberg filter expressions to be 
> serialized into splits. This temporary fix was to avoid OOM problems on Tez 
> AM side, but at the same time prevented predicate pushdowns to work on the 
> execution side too.
> This ticket intends to incorporate the long term solution. It turns out that 
> the file scan tasks created by Iceberg actually don't contain a "residual" 
> expressions, but rather a complete/original one. It becomes residual only 
> when it is evaluated against the tasks' partition value, which only happens 
> on the execution site. This means that the original filter is the same 
> expression for all splits in Tez AM, so we can transfer it via job conf 
> instead.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (HIVE-26137) Optimized transfer of Iceberg residual expressions from AM to execution

2022-04-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-26137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ádám Szita resolved HIVE-26137.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Committed to master. Thanks for the review [~mbod] 

> Optimized transfer of Iceberg residual expressions from AM to execution
> ---
>
> Key: HIVE-26137
> URL: https://issues.apache.org/jira/browse/HIVE-26137
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ádám Szita
>Assignee: Ádám Szita
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> HIVE-25967 introduced a hack to prevent Iceberg filter expressions to be 
> serialized into splits. This temporary fix was to avoid OOM problems on Tez 
> AM side, but at the same time prevented predicate pushdowns to work on the 
> execution side too.
> This ticket intends to incorporate the long term solution. It turns out that 
> the file scan tasks created by Iceberg actually don't contain a "residual" 
> expressions, but rather a complete/original one. It becomes residual only 
> when it is evaluated against the tasks' partition value, which only happens 
> on the execution site. This means that the original filter is the same 
> expression for all splits in Tez AM, so we can transfer it via job conf 
> instead.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26137) Optimized transfer of Iceberg residual expressions from AM to execution

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26137?focusedWorklogId=759329=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759329
 ]

ASF GitHub Bot logged work on HIVE-26137:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 15:46
Start Date: 20/Apr/22 15:46
Worklog Time Spent: 10m 
  Work Description: szlta merged PR #3203:
URL: https://github.com/apache/hive/pull/3203




Issue Time Tracking
---

Worklog Id: (was: 759329)
Time Spent: 40m  (was: 0.5h)

> Optimized transfer of Iceberg residual expressions from AM to execution
> ---
>
> Key: HIVE-26137
> URL: https://issues.apache.org/jira/browse/HIVE-26137
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ádám Szita
>Assignee: Ádám Szita
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> HIVE-25967 introduced a hack to prevent Iceberg filter expressions to be 
> serialized into splits. This temporary fix was to avoid OOM problems on Tez 
> AM side, but at the same time prevented predicate pushdowns to work on the 
> execution side too.
> This ticket intends to incorporate the long term solution. It turns out that 
> the file scan tasks created by Iceberg actually don't contain a "residual" 
> expressions, but rather a complete/original one. It becomes residual only 
> when it is evaluated against the tasks' partition value, which only happens 
> on the execution site. This means that the original filter is the same 
> expression for all splits in Tez AM, so we can transfer it via job conf 
> instead.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (HIVE-26137) Optimized transfer of Iceberg residual expressions from AM to execution

2022-04-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-26137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ádám Szita reassigned HIVE-26137:
-

Assignee: Ádám Szita

> Optimized transfer of Iceberg residual expressions from AM to execution
> ---
>
> Key: HIVE-26137
> URL: https://issues.apache.org/jira/browse/HIVE-26137
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ádám Szita
>Assignee: Ádám Szita
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> HIVE-25967 introduced a hack to prevent Iceberg filter expressions to be 
> serialized into splits. This temporary fix was to avoid OOM problems on Tez 
> AM side, but at the same time prevented predicate pushdowns to work on the 
> execution side too.
> This ticket intends to incorporate the long term solution. It turns out that 
> the file scan tasks created by Iceberg actually don't contain a "residual" 
> expressions, but rather a complete/original one. It becomes residual only 
> when it is evaluated against the tasks' partition value, which only happens 
> on the execution site. This means that the original filter is the same 
> expression for all splits in Tez AM, so we can transfer it via job conf 
> instead.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759316=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759316
 ]

ASF GitHub Bot logged work on HIVE-26151:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 15:26
Start Date: 20/Apr/22 15:26
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on PR #3222:
URL: https://github.com/apache/hive/pull/3222#issuecomment-1104065890

   The syntax should be marked experimental/alpha with evolving-semantics here. 
The problem is that FROM ... TO has a different semantics in standard SQL, it 
returns all rows that were active during the time window (so it can contain 
rows that were inserted before the time window), instead of just the appends 
that happened during the window. Let's discuss it tomorrow, but we might want 
to come up with a way to document it that this alpha/experimental at this stage.




Issue Time Tracking
---

Worklog Id: (was: 759316)
Time Spent: 1h 40m  (was: 1.5h)

> Support range-based time travel queries for Iceberg
> ---
>
> Key: HIVE-26151
> URL: https://issues.apache.org/jira/browse/HIVE-26151
> Project: Hive
>  Issue Type: New Feature
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Allow querying which records have been inserted during a certain time window 
> for Iceberg tables. The Iceberg TableScan API provides an implementation for 
> that, so most of the work would go into adding syntax support and 
> transporting the startTime and endTime parameters to the Iceberg input format.
> Proposed new syntax: 
> SELECT * FROM table FOR SYSTEM_TIME FROM '' TO ''
> SELECT * FROM table FOR SYSTEM_VERSION FROM  TO 
> (the TO clause is optional in both cases)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26157?focusedWorklogId=759306=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759306
 ]

ASF GitHub Bot logged work on HIVE-26157:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 15:18
Start Date: 20/Apr/22 15:18
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on code in PR #3226:
URL: https://github.com/apache/hive/pull/3226#discussion_r854264528


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##
@@ -104,7 +106,7 @@
 public class HiveIcebergStorageHandler implements HiveStoragePredicateHandler, 
HiveStorageHandler {
   private static final Logger LOG = 
LoggerFactory.getLogger(HiveIcebergStorageHandler.class);
 
-  private static final String ICEBERG_URI_PREFIX = "iceberg://";
+  private static final String ICEBERG_URI_PREFIX = "iceberg:/";

Review Comment:
   Was this change necessary? I think it's supposed to work like a 'protocol' 
that's why there is the double-slash. I think kafka and druid did the same 
thing (i.e. `druid://` and `kafka://`)





Issue Time Tracking
---

Worklog Id: (was: 759306)
Time Spent: 1h  (was: 50m)

> Change Iceberg storage handler authz URI to metadata location
> -
>
> Key: HIVE-26157
> URL: https://issues.apache.org/jira/browse/HIVE-26157
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In HIVE-25964, the authz URI has been changed to "iceberg://db.table".
> It is possible to set the metadata pointers of table A to point to table B, 
> and therefore you could read table B's data via querying table A.
> {code:sql}
> alter table A set tblproperties 
> ('metadata_location'='/path/to/B/snapshot.json', 
> 'previous_metadata_location'='/path/to/B/prev_snapshot.json');  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26157?focusedWorklogId=759265=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759265
 ]

ASF GitHub Bot logged work on HIVE-26157:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 14:29
Start Date: 20/Apr/22 14:29
Worklog Time Spent: 10m 
  Work Description: szlta commented on code in PR #3226:
URL: https://github.com/apache/hive/pull/3226#discussion_r854208599


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java:
##
@@ -43,6 +45,26 @@ private IcebergTableUtil() {
 
   }
 
+  /**
+   * Constructs the table properties needed for the Iceberg table loading by 
retrieving the information from the
+   * hmsTable. It then calls {@link IcebergTableUtil#getTable(Configuration, 
Properties)} with these properties.
+   * @param configuration a Hadoop configuration
+   * @param hmsTable the HMS table
+   * @return the Iceberg table
+   */
+  static Table getTable(Configuration configuration, 
org.apache.hadoop.hive.metastore.api.Table hmsTable) {
+Properties properties = new Properties();
+properties.setProperty(Catalogs.NAME, 
TableIdentifier.of(hmsTable.getDbName(), hmsTable.getTableName()).toString());
+if (hmsTable.getSd() != null) {
+  properties.setProperty(Catalogs.LOCATION, 
hmsTable.getSd().getLocation());
+}
+if (hmsTable.getParameters().containsKey(InputFormatConfig.CATALOG_NAME)) {
+  properties.setProperty(
+  InputFormatConfig.CATALOG_NAME, 
hmsTable.getParameters().get(InputFormatConfig.CATALOG_NAME));
+}

Review Comment:
   nit: could do with one look-up only by calling get() beforehand, and 
refactoring the if condition into null check.





Issue Time Tracking
---

Worklog Id: (was: 759265)
Time Spent: 50m  (was: 40m)

> Change Iceberg storage handler authz URI to metadata location
> -
>
> Key: HIVE-26157
> URL: https://issues.apache.org/jira/browse/HIVE-26157
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> In HIVE-25964, the authz URI has been changed to "iceberg://db.table".
> It is possible to set the metadata pointers of table A to point to table B, 
> and therefore you could read table B's data via querying table A.
> {code:sql}
> alter table A set tblproperties 
> ('metadata_location'='/path/to/B/snapshot.json', 
> 'previous_metadata_location'='/path/to/B/prev_snapshot.json');  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26157?focusedWorklogId=759255=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759255
 ]

ASF GitHub Bot logged work on HIVE-26157:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 14:18
Start Date: 20/Apr/22 14:18
Worklog Time Spent: 10m 
  Work Description: szlta commented on code in PR #3226:
URL: https://github.com/apache/hive/pull/3226#discussion_r854194435


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##
@@ -450,7 +452,9 @@ public boolean isValidMetadataTable(String metaTableName) {
   public URI getURIForAuth(org.apache.hadoop.hive.metastore.api.Table 
hmsTable) throws URISyntaxException {
 String dbName = hmsTable.getDbName();
 String tableName = hmsTable.getTableName();
-return new URI(ICEBERG_URI_PREFIX + dbName + "/" + tableName);
+Table table = IcebergTableUtil.getTable(conf, hmsTable);

Review Comment:
   Giving back `hmsTable.getSd().location() + "/metadata"` seems reasonable to 
me in such cases





Issue Time Tracking
---

Worklog Id: (was: 759255)
Time Spent: 40m  (was: 0.5h)

> Change Iceberg storage handler authz URI to metadata location
> -
>
> Key: HIVE-26157
> URL: https://issues.apache.org/jira/browse/HIVE-26157
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> In HIVE-25964, the authz URI has been changed to "iceberg://db.table".
> It is possible to set the metadata pointers of table A to point to table B, 
> and therefore you could read table B's data via querying table A.
> {code:sql}
> alter table A set tblproperties 
> ('metadata_location'='/path/to/B/snapshot.json', 
> 'previous_metadata_location'='/path/to/B/prev_snapshot.json');  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26157?focusedWorklogId=759247=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759247
 ]

ASF GitHub Bot logged work on HIVE-26157:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 13:50
Start Date: 20/Apr/22 13:50
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on code in PR #3226:
URL: https://github.com/apache/hive/pull/3226#discussion_r854160373


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##
@@ -450,7 +452,9 @@ public boolean isValidMetadataTable(String metaTableName) {
   public URI getURIForAuth(org.apache.hadoop.hive.metastore.api.Table 
hmsTable) throws URISyntaxException {
 String dbName = hmsTable.getDbName();
 String tableName = hmsTable.getTableName();
-return new URI(ICEBERG_URI_PREFIX + dbName + "/" + tableName);
+Table table = IcebergTableUtil.getTable(conf, hmsTable);

Review Comment:
   We ran into problems with the approach of loading the Iceberg table here 
before. The problem is that this method can be called to authorize CREATE TABLE 
commands as well, at which point the iceberg table does not exist yet, so this 
will lead to NPE.
   
   In that case, when the table object is null, then maybe we fall back to 
using the `hmsTable.getSd().location() + "/metadata"`? I'm not sure though, 
just thinking out loud 





Issue Time Tracking
---

Worklog Id: (was: 759247)
Time Spent: 0.5h  (was: 20m)

> Change Iceberg storage handler authz URI to metadata location
> -
>
> Key: HIVE-26157
> URL: https://issues.apache.org/jira/browse/HIVE-26157
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In HIVE-25964, the authz URI has been changed to "iceberg://db.table".
> It is possible to set the metadata pointers of table A to point to table B, 
> and therefore you could read table B's data via querying table A.
> {code:sql}
> alter table A set tblproperties 
> ('metadata_location'='/path/to/B/snapshot.json', 
> 'previous_metadata_location'='/path/to/B/prev_snapshot.json');  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26157?focusedWorklogId=759246=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759246
 ]

ASF GitHub Bot logged work on HIVE-26157:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 13:50
Start Date: 20/Apr/22 13:50
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on code in PR #3226:
URL: https://github.com/apache/hive/pull/3226#discussion_r854160373


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##
@@ -450,7 +452,9 @@ public boolean isValidMetadataTable(String metaTableName) {
   public URI getURIForAuth(org.apache.hadoop.hive.metastore.api.Table 
hmsTable) throws URISyntaxException {
 String dbName = hmsTable.getDbName();
 String tableName = hmsTable.getTableName();
-return new URI(ICEBERG_URI_PREFIX + dbName + "/" + tableName);
+Table table = IcebergTableUtil.getTable(conf, hmsTable);

Review Comment:
   We ran into problems with the approach of loading the Iceberg table here 
before. The problem is that this method can be called to authorize CREATE TABLE 
commands as well, at which point the iceberg table does not exist yet, so this 
will lead to NPE.
   
   If the table object is null, then maybe we can use the 
hmsTable.getSd().location() + "/metadata"? I'm not sure though, just thinking 
out loud 





Issue Time Tracking
---

Worklog Id: (was: 759246)
Time Spent: 20m  (was: 10m)

> Change Iceberg storage handler authz URI to metadata location
> -
>
> Key: HIVE-26157
> URL: https://issues.apache.org/jira/browse/HIVE-26157
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In HIVE-25964, the authz URI has been changed to "iceberg://db.table".
> It is possible to set the metadata pointers of table A to point to table B, 
> and therefore you could read table B's data via querying table A.
> {code:sql}
> alter table A set tblproperties 
> ('metadata_location'='/path/to/B/snapshot.json', 
> 'previous_metadata_location'='/path/to/B/prev_snapshot.json');  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26156) Iceberg delete writer should handle deleting from old partition specs

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26156?focusedWorklogId=759239=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759239
 ]

ASF GitHub Bot logged work on HIVE-26156:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 13:39
Start Date: 20/Apr/22 13:39
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on code in PR #3225:
URL: https://github.com/apache/hive/pull/3225#discussion_r854147948


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##
@@ -374,9 +373,12 @@ public DynamicPartitionCtx createDPContext(HiveConf 
hiveConf, org.apache.hadoop.
   fieldOrderMap.put(fields.get(i).name(), i);
 }
 
+// deletes already use the bucket values in the partition_struct for 
sorting, so no need to add the sort expression

Review Comment:
   Yes, good catch, I think we can avoid sorting by the other partition columns 
too.





Issue Time Tracking
---

Worklog Id: (was: 759239)
Time Spent: 40m  (was: 0.5h)

> Iceberg delete writer should handle deleting from old partition specs
> -
>
> Key: HIVE-26156
> URL: https://issues.apache.org/jira/browse/HIVE-26156
> Project: Hive
>  Issue Type: Bug
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> While {{HiveIcebergRecordWriter}} always writes data out according to the 
> latest spec, the {{HiveIcebergDeleteWriter}} might have to write delete files 
> into partitions that correspond to a variety of specs, both old and new. 
> Therefore we should pass the {{{}table.specs(){}}}map into the 
> {{HiveIcebergWriter}} so that the delete writer can choose the appropriate 
> spec on a per-record basis.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (HIVE-24920) TRANSLATED_TO_EXTERNAL tables may write to the same location

2022-04-20 Thread tanghui (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524983#comment-17524983
 ] 

tanghui edited comment on HIVE-24920 at 4/20/22 1:38 PM:
-

After the patch is updated, the partition table location and hdfs data 
directory are displayed normally, but the partition location of the table in 
the SDS in the Hive metabase is still displayed as the location of the old 
table, resulting in no data in the query partition.

 

in beeline:



set hive.create.as.external.legacy=true;

CREATE TABLE part_test(
c1 string
,c2 string
)PARTITIONED BY (dat string)

insert into part_test values ("11","th","20220101")
insert into part_test values ("22","th","20220102")

alter table part_test rename to part_test11;

--this result is null.
select * from part_test11 where dat="20220101";
||part_test.c1||part_test.c2||part_test.dat||
| | | |

-

SDS in the Hive metabase:
select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
TBLS.TBL_ID=SDS.CD_ID;

---
|*LOCATION*|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|

---

 

We need to modify the partition location of the table in SDS to ensure that the 
query results are normal


was (Author: sanguines):
After the patch is updated, the partition table location and hdfs data 
directory are displayed normally, but the partition location of the table in 
the SDS in the Hive metabase is still displayed as the location of the old 
table, resulting in no data in the query partition.



set hive.create.as.external.legacy=true;

CREATE TABLE part_test(
c1 string
,c2 string
)PARTITIONED BY (dat string)

insert into part_test values ("11","th","20220101")
insert into part_test values ("22","th","20220102")

alter table part_test rename to part_test11;

--this resulting in no data in the query partition.
select * from part_test11 where dat="20220101";
-

SDS in the Hive metabase:
select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
TBLS.TBL_ID=SDS.CD_ID;

---
|LOCATION|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|

---

 

We need to modify the partition location of the table in SDS to ensure that the 
query results are normal

 

> TRANSLATED_TO_EXTERNAL tables may write to the same location
> 
>
> Key: HIVE-24920
> URL: https://issues.apache.org/jira/browse/HIVE-24920
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: metastore_translator, pull-request-available
> Fix For: 4.0.0, 4.0.0-alpha-1
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> {code}
> create table t (a integer);
> insert into t values(1);
> alter table t rename to t2;
> create table t (a integer); -- I expected an exception from this command 
> (location already exists) but because its an external table no exception
> insert into t values(2);
> select * from t;  -- shows 1 and 2
> drop table t2;-- wipes out data location
> select * from t;  -- empty resultset
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26156) Iceberg delete writer should handle deleting from old partition specs

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26156?focusedWorklogId=759237=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759237
 ]

ASF GitHub Bot logged work on HIVE-26156:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 13:37
Start Date: 20/Apr/22 13:37
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on code in PR #3225:
URL: https://github.com/apache/hive/pull/3225#discussion_r854146249


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergRecordWriter.java:
##
@@ -37,17 +38,17 @@
 
 class HiveIcebergRecordWriter extends HiveIcebergWriter {
 
-  HiveIcebergRecordWriter(Schema schema, PartitionSpec spec, FileFormat format,
+  HiveIcebergRecordWriter(Schema schema, Map specs, 
FileFormat format,
   FileWriterFactory fileWriterFactory, OutputFileFactory 
fileFactory, FileIO io, long targetFileSize,
   TaskAttemptID taskAttemptID, String tableName) {
-super(schema, spec, io, taskAttemptID, tableName,
+super(schema, specs, io, taskAttemptID, tableName,
 new ClusteredDataWriter<>(fileWriterFactory, fileFactory, io, format, 
targetFileSize));
   }
 
   @Override
   public void write(Writable row) throws IOException {
 Record record = ((Container) row).get();
-writer.write(record, spec, partition(record));
+writer.write(record, specs.get(specs.size() - 1), partition(record));

Review Comment:
   Good catch, I did not know that Iceberg reused the old spec in step3. Will 
store the latest spec in the record writer then





Issue Time Tracking
---

Worklog Id: (was: 759237)
Time Spent: 0.5h  (was: 20m)

> Iceberg delete writer should handle deleting from old partition specs
> -
>
> Key: HIVE-26156
> URL: https://issues.apache.org/jira/browse/HIVE-26156
> Project: Hive
>  Issue Type: Bug
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> While {{HiveIcebergRecordWriter}} always writes data out according to the 
> latest spec, the {{HiveIcebergDeleteWriter}} might have to write delete files 
> into partitions that correspond to a variety of specs, both old and new. 
> Therefore we should pass the {{{}table.specs(){}}}map into the 
> {{HiveIcebergWriter}} so that the delete writer can choose the appropriate 
> spec on a per-record basis.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HIVE-26158) TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after rename table

2022-04-20 Thread tanghui (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tanghui updated HIVE-26158:
---
Description: 
After the patch is updated, the partition table location and hdfs data 
directory are displayed normally, but the partition location of the table in 
the SDS in the Hive metabase is still displayed as the location of the old 
table, resulting in no data in the query partition.

 

in beeline:



set hive.create.as.external.legacy=true;

CREATE TABLE part_test(
c1 string
,c2 string
)PARTITIONED BY (dat string)

insert into part_test values ("11","th","20220101")
insert into part_test values ("22","th","20220102")

alter table part_test rename to part_test11;

--this result is null.
select * from part_test11 where dat="20220101";
||part_test.c1||part_test.c2||part_test.dat||
| | | |

-

SDS in the Hive metabase:
select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
TBLS.TBL_ID=SDS.CD_ID;

---
|*LOCATION*|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|

---

 

We need to modify the partition location of the table in SDS to ensure that the 
query results are normal

  was:
After the patch is updated, the partition table location and hdfs data 
directory are displayed normally, but the partition location of the table in 
the SDS in the Hive metabase is still displayed as the location of the old 
table, resulting in no data in the query partition.

 

in beeline:



set hive.create.as.external.legacy=true;

CREATE TABLE part_test(
c1 string
,c2 string
)PARTITIONED BY (dat string)

insert into part_test values ("11","th","20220101")
insert into part_test values ("22","th","20220102")

alter table part_test rename to part_test11;

--this result is null.
select * from part_test11 where dat="20220101";
||part_test.c1||part_test.c2||part_test.dat||
| | | |


-

SDS in the Hive metabase:
select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
TBLS.TBL_ID=SDS.CD_ID;

---
|LOCATION|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|

---

 

We need to modify the partition location of the table in SDS to ensure that the 
query results are normal


> TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after 
> rename table
> --
>
> Key: HIVE-26158
> URL: https://issues.apache.org/jira/browse/HIVE-26158
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.0-alpha-1, 4.0.0-alpha-2
>Reporter: tanghui
>Priority: Major
>
> After the patch is updated, the partition table location and hdfs data 
> directory are displayed normally, but the partition location of the table in 
> the SDS in the Hive metabase is still displayed as the location of the old 
> table, resulting in no data in the query partition.
>  
> in beeline:
> 
> set hive.create.as.external.legacy=true;
> CREATE TABLE part_test(
> c1 string
> ,c2 string
> )PARTITIONED BY (dat string)
> insert into part_test values ("11","th","20220101")
> insert into part_test values ("22","th","20220102")
> alter table part_test rename to part_test11;
> --this result is null.
> select * from part_test11 where dat="20220101";
> ||part_test.c1||part_test.c2||part_test.dat||
> | | | |
> -
> SDS in the Hive metabase:
> select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
> TBLS.TBL_ID=SDS.CD_ID;
> ---
> |*LOCATION*|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
> 

[jira] [Work logged] (HIVE-26156) Iceberg delete writer should handle deleting from old partition specs

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26156?focusedWorklogId=759236=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759236
 ]

ASF GitHub Bot logged work on HIVE-26156:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 13:37
Start Date: 20/Apr/22 13:37
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on code in PR #3225:
URL: https://github.com/apache/hive/pull/3225#discussion_r854145547


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergDeleteWriter.java:
##
@@ -53,7 +54,8 @@ public class HiveIcebergDeleteWriter extends 
HiveIcebergWriter {
   public void write(Writable row) throws IOException {
 Record rec = ((Container) row).get();
 PositionDelete positionDelete = 
IcebergAcidUtil.getPositionDelete(rec, rowDataTemplate);
-writer.write(positionDelete, spec, partition(positionDelete.row()));
+Integer specId = rec.get(0, Integer.class);

Review Comment:
   actually, there's an existing util method to parse out the specid from a 
record, so I'll use that





Issue Time Tracking
---

Worklog Id: (was: 759236)
Time Spent: 20m  (was: 10m)

> Iceberg delete writer should handle deleting from old partition specs
> -
>
> Key: HIVE-26156
> URL: https://issues.apache.org/jira/browse/HIVE-26156
> Project: Hive
>  Issue Type: Bug
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While {{HiveIcebergRecordWriter}} always writes data out according to the 
> latest spec, the {{HiveIcebergDeleteWriter}} might have to write delete files 
> into partitions that correspond to a variety of specs, both old and new. 
> Therefore we should pass the {{{}table.specs(){}}}map into the 
> {{HiveIcebergWriter}} so that the delete writer can choose the appropriate 
> spec on a per-record basis.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HIVE-26158) TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after rename table

2022-04-20 Thread tanghui (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tanghui updated HIVE-26158:
---
Description: 
After the patch is updated, the partition table location and hdfs data 
directory are displayed normally, but the partition location of the table in 
the SDS in the Hive metabase is still displayed as the location of the old 
table, resulting in no data in the query partition.

 

in beeline:



set hive.create.as.external.legacy=true;

CREATE TABLE part_test(
c1 string
,c2 string
)PARTITIONED BY (dat string)

insert into part_test values ("11","th","20220101")
insert into part_test values ("22","th","20220102")

alter table part_test rename to part_test11;

--this result is null.
select * from part_test11 where dat="20220101";
||part_test.c1||part_test.c2||part_test.dat||
| | | |


-

SDS in the Hive metabase:
select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
TBLS.TBL_ID=SDS.CD_ID;

---
|LOCATION|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|

---

 

We need to modify the partition location of the table in SDS to ensure that the 
query results are normal

  was:
After the patch is updated, the partition table location and hdfs data 
directory are displayed normally, but the partition location of the table in 
the SDS in the Hive metabase is still displayed as the location of the old 
table, resulting in no data in the query partition.

 

in beeline:



set hive.create.as.external.legacy=true;

CREATE TABLE part_test(
c1 string
,c2 string
)PARTITIONED BY (dat string)

insert into part_test values ("11","th","20220101")
insert into part_test values ("22","th","20220102")

alter table part_test rename to part_test11;

--this resulting in no data in the query partition.
select * from part_test11 where dat="20220101";
-

SDS in the Hive metabase:
select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
TBLS.TBL_ID=SDS.CD_ID;

---
|LOCATION|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|

---

 

We need to modify the partition location of the table in SDS to ensure that the 
query results are normal


> TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after 
> rename table
> --
>
> Key: HIVE-26158
> URL: https://issues.apache.org/jira/browse/HIVE-26158
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.0-alpha-1, 4.0.0-alpha-2
>Reporter: tanghui
>Priority: Major
>
> After the patch is updated, the partition table location and hdfs data 
> directory are displayed normally, but the partition location of the table in 
> the SDS in the Hive metabase is still displayed as the location of the old 
> table, resulting in no data in the query partition.
>  
> in beeline:
> 
> set hive.create.as.external.legacy=true;
> CREATE TABLE part_test(
> c1 string
> ,c2 string
> )PARTITIONED BY (dat string)
> insert into part_test values ("11","th","20220101")
> insert into part_test values ("22","th","20220102")
> alter table part_test rename to part_test11;
> --this result is null.
> select * from part_test11 where dat="20220101";
> ||part_test.c1||part_test.c2||part_test.dat||
> | | | |
> -
> SDS in the Hive metabase:
> select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
> TBLS.TBL_ID=SDS.CD_ID;
> ---
> |LOCATION|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
> 

[jira] [Work logged] (HIVE-21456) Hive Metastore Thrift over HTTP

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-21456?focusedWorklogId=759233=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759233
 ]

ASF GitHub Bot logged work on HIVE-21456:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 13:33
Start Date: 20/Apr/22 13:33
Worklog Time Spent: 10m 
  Work Description: yongzhi merged PR #3105:
URL: https://github.com/apache/hive/pull/3105




Issue Time Tracking
---

Worklog Id: (was: 759233)
Time Spent: 6h 40m  (was: 6.5h)

> Hive Metastore Thrift over HTTP
> ---
>
> Key: HIVE-21456
> URL: https://issues.apache.org/jira/browse/HIVE-21456
> Project: Hive
>  Issue Type: New Feature
>  Components: Metastore, Standalone Metastore
>Reporter: Amit Khanna
>Assignee: Sourabh Goyal
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-21456.2.patch, HIVE-21456.3.patch, 
> HIVE-21456.4.patch, HIVE-21456.patch
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Hive Metastore currently doesn't have support for HTTP transport because of 
> which it is not possible to access it via Knox. Adding support for Thrift 
> over HTTP transport will allow the clients to access via Knox



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HIVE-26158) TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after rename table

2022-04-20 Thread tanghui (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tanghui updated HIVE-26158:
---
Description: 
After the patch is updated, the partition table location and hdfs data 
directory are displayed normally, but the partition location of the table in 
the SDS in the Hive metabase is still displayed as the location of the old 
table, resulting in no data in the query partition.

 

in beeline:



set hive.create.as.external.legacy=true;

CREATE TABLE part_test(
c1 string
,c2 string
)PARTITIONED BY (dat string)

insert into part_test values ("11","th","20220101")
insert into part_test values ("22","th","20220102")

alter table part_test rename to part_test11;

--this resulting in no data in the query partition.
select * from part_test11 where dat="20220101";
-

SDS in the Hive metabase:
select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
TBLS.TBL_ID=SDS.CD_ID;

---
|LOCATION|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|

---

 

We need to modify the partition location of the table in SDS to ensure that the 
query results are normal

  was:
After the patch is updated, the partition table location and hdfs data 
directory are displayed normally, but the partition location of the table in 
the SDS in the Hive metabase is still displayed as the location of the old 
table, resulting in no data in the query partition.



set hive.create.as.external.legacy=true;

CREATE TABLE part_test(
c1 string
,c2 string
)PARTITIONED BY (dat string)

insert into part_test values ("11","th","20220101")
insert into part_test values ("22","th","20220102")

alter table part_test rename to part_test11;

--this resulting in no data in the query partition.
select * from part_test11 where dat="20220101";
-

SDS in the Hive metabase:
select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
TBLS.TBL_ID=SDS.CD_ID;

---
|LOCATION|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|

---

 

We need to modify the partition location of the table in SDS to ensure that the 
query results are normal


> TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after 
> rename table
> --
>
> Key: HIVE-26158
> URL: https://issues.apache.org/jira/browse/HIVE-26158
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.0-alpha-1, 4.0.0-alpha-2
>Reporter: tanghui
>Priority: Major
>
> After the patch is updated, the partition table location and hdfs data 
> directory are displayed normally, but the partition location of the table in 
> the SDS in the Hive metabase is still displayed as the location of the old 
> table, resulting in no data in the query partition.
>  
> in beeline:
> 
> set hive.create.as.external.legacy=true;
> CREATE TABLE part_test(
> c1 string
> ,c2 string
> )PARTITIONED BY (dat string)
> insert into part_test values ("11","th","20220101")
> insert into part_test values ("22","th","20220102")
> alter table part_test rename to part_test11;
> --this resulting in no data in the query partition.
> select * from part_test11 where dat="20220101";
> -
> SDS in the Hive metabase:
> select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
> TBLS.TBL_ID=SDS.CD_ID;
> ---
> |LOCATION|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|
> 

[jira] [Updated] (HIVE-26158) TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after rename table

2022-04-20 Thread tanghui (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tanghui updated HIVE-26158:
---
Summary: TRANSLATED_TO_EXTERNAL partition tables cannot query partition 
data after rename table  (was: TRANSLATED_TO_EXTERNAL partition tables cannot 
query partition data after rename)

> TRANSLATED_TO_EXTERNAL partition tables cannot query partition data after 
> rename table
> --
>
> Key: HIVE-26158
> URL: https://issues.apache.org/jira/browse/HIVE-26158
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0, 4.0.0-alpha-1, 4.0.0-alpha-2
>Reporter: tanghui
>Priority: Major
>
> After the patch is updated, the partition table location and hdfs data 
> directory are displayed normally, but the partition location of the table in 
> the SDS in the Hive metabase is still displayed as the location of the old 
> table, resulting in no data in the query partition.
> 
> set hive.create.as.external.legacy=true;
> CREATE TABLE part_test(
> c1 string
> ,c2 string
> )PARTITIONED BY (dat string)
> insert into part_test values ("11","th","20220101")
> insert into part_test values ("22","th","20220102")
> alter table part_test rename to part_test11;
> --this resulting in no data in the query partition.
> select * from part_test11 where dat="20220101";
> -
> SDS in the Hive metabase:
> select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
> TBLS.TBL_ID=SDS.CD_ID;
> ---
> |LOCATION|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
> |hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|
> ---
>  
> We need to modify the partition location of the table in SDS to ensure that 
> the query results are normal



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-26157:
--
Labels: pull-request-available  (was: )

> Change Iceberg storage handler authz URI to metadata location
> -
>
> Key: HIVE-26157
> URL: https://issues.apache.org/jira/browse/HIVE-26157
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In HIVE-25964, the authz URI has been changed to "iceberg://db.table".
> It is possible to set the metadata pointers of table A to point to table B, 
> and therefore you could read table B's data via querying table A.
> {code:sql}
> alter table A set tblproperties 
> ('metadata_location'='/path/to/B/snapshot.json', 
> 'previous_metadata_location'='/path/to/B/prev_snapshot.json');  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26157?focusedWorklogId=759231=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759231
 ]

ASF GitHub Bot logged work on HIVE-26157:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 13:30
Start Date: 20/Apr/22 13:30
Worklog Time Spent: 10m 
  Work Description: lcspinter opened a new pull request, #3226:
URL: https://github.com/apache/hive/pull/3226

   
   
   ### What changes were proposed in this pull request?
   Change Iceberg storage handler autz URI from `iceberg://dbName/tableName` 
format to `iceberg://metadataLocation`
   
   
   
   ### Why are the changes needed?
   It is possible to set the metadata pointers of table A to point to table B, 
and therefore you could read table B's data via querying table A.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   
   ### How was this patch tested?
   Manual test, unit test
   
   




Issue Time Tracking
---

Worklog Id: (was: 759231)
Remaining Estimate: 0h
Time Spent: 10m

> Change Iceberg storage handler authz URI to metadata location
> -
>
> Key: HIVE-26157
> URL: https://issues.apache.org/jira/browse/HIVE-26157
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In HIVE-25964, the authz URI has been changed to "iceberg://db.table".
> It is possible to set the metadata pointers of table A to point to table B, 
> and therefore you could read table B's data via querying table A.
> {code:sql}
> alter table A set tblproperties 
> ('metadata_location'='/path/to/B/snapshot.json', 
> 'previous_metadata_location'='/path/to/B/prev_snapshot.json');  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (HIVE-24920) TRANSLATED_TO_EXTERNAL tables may write to the same location

2022-04-20 Thread tanghui (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524983#comment-17524983
 ] 

tanghui edited comment on HIVE-24920 at 4/20/22 1:18 PM:
-

After the patch is updated, the partition table location and hdfs data 
directory are displayed normally, but the partition location of the table in 
the SDS in the Hive metabase is still displayed as the location of the old 
table, resulting in no data in the query partition.



set hive.create.as.external.legacy=true;

CREATE TABLE part_test(
c1 string
,c2 string
)PARTITIONED BY (dat string)

insert into part_test values ("11","th","20220101")
insert into part_test values ("22","th","20220102")

alter table part_test rename to part_test11;

--this resulting in no data in the query partition.
select * from part_test11 where dat="20220101";
-

SDS in the Hive metabase:
select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
TBLS.TBL_ID=SDS.CD_ID;

---
|LOCATION|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|

---

 

We need to modify the partition location of the table in SDS to ensure that the 
query results are normal

 


was (Author: sanguines):
After the patch is updated, the partition table location and hdfs data 
directory are displayed normally, but the partition location of the table in 
the SDS in the Hive metabase is still displayed as the location of the old 
table, resulting in no data in the query partition.


CREATE TABLE part_test(
c1 string
,c2 string
)PARTITIONED BY (dat string)

insert into part_test values ("11","th","20220101")
insert into part_test values ("22","th","20220102")

alter table part_test rename to part_test11;

--this resulting in no data in the query partition.
select * from part_test11 where dat="20220101";
-

SDS in the Hive metabase:
select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
TBLS.TBL_ID=SDS.CD_ID;

---
|LOCATION|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|

---

 

We need to modify the partition location of the table in SDS to ensure that the 
query results are normal

 

> TRANSLATED_TO_EXTERNAL tables may write to the same location
> 
>
> Key: HIVE-24920
> URL: https://issues.apache.org/jira/browse/HIVE-24920
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: metastore_translator, pull-request-available
> Fix For: 4.0.0, 4.0.0-alpha-1
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> {code}
> create table t (a integer);
> insert into t values(1);
> alter table t rename to t2;
> create table t (a integer); -- I expected an exception from this command 
> (location already exists) but because its an external table no exception
> insert into t values(2);
> select * from t;  -- shows 1 and 2
> drop table t2;-- wipes out data location
> select * from t;  -- empty resultset
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location

2022-04-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-26157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Pintér reassigned HIVE-26157:



> Change Iceberg storage handler authz URI to metadata location
> -
>
> Key: HIVE-26157
> URL: https://issues.apache.org/jira/browse/HIVE-26157
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>
> In HIVE-25964, the authz URI has been changed to "iceberg://db.table".
> It is possible to set the metadata pointers of table A to point to table B, 
> and therefore you could read table B's data via querying table A.
> {code:sql}
> alter table A set tblproperties 
> ('metadata_location'='/path/to/B/snapshot.json', 
> 'previous_metadata_location'='/path/to/B/prev_snapshot.json');  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26156) Iceberg delete writer should handle deleting from old partition specs

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26156?focusedWorklogId=759212=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759212
 ]

ASF GitHub Bot logged work on HIVE-26156:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 13:13
Start Date: 20/Apr/22 13:13
Worklog Time Spent: 10m 
  Work Description: szlta commented on code in PR #3225:
URL: https://github.com/apache/hive/pull/3225#discussion_r854074661


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergDeleteWriter.java:
##
@@ -53,7 +54,8 @@ public class HiveIcebergDeleteWriter extends 
HiveIcebergWriter {
   public void write(Writable row) throws IOException {
 Record rec = ((Container) row).get();
 PositionDelete positionDelete = 
IcebergAcidUtil.getPositionDelete(rec, rowDataTemplate);
-writer.write(positionDelete, spec, partition(positionDelete.row()));
+Integer specId = rec.get(0, Integer.class);

Review Comment:
   I guess this is meta column at position 0? Could you annotate with some 
comments please?



##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java:
##
@@ -374,9 +373,12 @@ public DynamicPartitionCtx createDPContext(HiveConf 
hiveConf, org.apache.hadoop.
   fieldOrderMap.put(fields.get(i).name(), i);
 }
 
+// deletes already use the bucket values in the partition_struct for 
sorting, so no need to add the sort expression

Review Comment:
   Is this true for bucket transform only? Is bucket() special for some reason, 
or could we avoid sorting according to other partition columns (and transform 
types) as well?



##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergWriter.java:
##
@@ -101,7 +101,14 @@ public void close(boolean abort) throws IOException {
   }
 
   protected PartitionKey partition(Record row) {
-currentKey.partition(wrapper.wrap(row));
-return currentKey;
+// get partition key for the latest spec
+return partition(row, specs.size() - 1);

Review Comment:
   Same question for getting the latest spec.



##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergRecordWriter.java:
##
@@ -37,17 +38,17 @@
 
 class HiveIcebergRecordWriter extends HiveIcebergWriter {
 
-  HiveIcebergRecordWriter(Schema schema, PartitionSpec spec, FileFormat format,
+  HiveIcebergRecordWriter(Schema schema, Map specs, 
FileFormat format,
   FileWriterFactory fileWriterFactory, OutputFileFactory 
fileFactory, FileIO io, long targetFileSize,
   TaskAttemptID taskAttemptID, String tableName) {
-super(schema, spec, io, taskAttemptID, tableName,
+super(schema, specs, io, taskAttemptID, tableName,
 new ClusteredDataWriter<>(fileWriterFactory, fileFactory, io, format, 
targetFileSize));
   }
 
   @Override
   public void write(Writable row) throws IOException {
 Record record = ((Container) row).get();
-writer.write(record, spec, partition(record));
+writer.write(record, specs.get(specs.size() - 1), partition(record));

Review Comment:
   Are we trying to get the latest spec here? If so it could become problematic 
if an older spec is reused to be the current one.
   E.g: partition evolution goes as:
   
   1. initially partitioned by col and col2 -> spec0: identity(col), 
identity(col2); latest_spec=0
   2. remove col from spec -> spec1: identity(col2); latest_spec=1
   3. re-add col to spec -> no new spec is created; latest_spec=0
   





Issue Time Tracking
---

Worklog Id: (was: 759212)
Remaining Estimate: 0h
Time Spent: 10m

> Iceberg delete writer should handle deleting from old partition specs
> -
>
> Key: HIVE-26156
> URL: https://issues.apache.org/jira/browse/HIVE-26156
> Project: Hive
>  Issue Type: Bug
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While {{HiveIcebergRecordWriter}} always writes data out according to the 
> latest spec, the {{HiveIcebergDeleteWriter}} might have to write delete files 
> into partitions that correspond to a variety of specs, both old and new. 
> Therefore we should pass the {{{}table.specs(){}}}map into the 
> {{HiveIcebergWriter}} so that the delete writer can choose the appropriate 
> spec on a per-record basis.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HIVE-24920) TRANSLATED_TO_EXTERNAL tables may write to the same location

2022-04-20 Thread tanghui (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524983#comment-17524983
 ] 

tanghui commented on HIVE-24920:


After the patch is updated, the partition table location and hdfs data 
directory are displayed normally, but the partition location of the table in 
the SDS in the Hive metabase is still displayed as the location of the old 
table, resulting in no data in the query partition.


CREATE TABLE part_test(
c1 string
,c2 string
)PARTITIONED BY (dat string)

insert into part_test values ("11","th","20220101")
insert into part_test values ("22","th","20220102")

alter table part_test rename to part_test11;

--this resulting in no data in the query partition.
select * from part_test11 where dat="20220101";
-

SDS in the Hive metabase:
select SDS.LOCATION from TBLS,SDS where TBLS.TBL_NAME="part_test11" AND 
TBLS.TBL_ID=SDS.CD_ID;

---
|LOCATION|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test11|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220101|
|hdfs://nameservice1/warehouse/tablespace/external/hive/part_test/dat=20220102|

---

 

We need to modify the partition location of the table in SDS to ensure that the 
query results are normal

 

> TRANSLATED_TO_EXTERNAL tables may write to the same location
> 
>
> Key: HIVE-24920
> URL: https://issues.apache.org/jira/browse/HIVE-24920
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: metastore_translator, pull-request-available
> Fix For: 4.0.0, 4.0.0-alpha-1
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> {code}
> create table t (a integer);
> insert into t values(1);
> alter table t rename to t2;
> create table t (a integer); -- I expected an exception from this command 
> (location already exists) but because its an external table no exception
> insert into t values(2);
> select * from t;  -- shows 1 and 2
> drop table t2;-- wipes out data location
> select * from t;  -- empty resultset
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HIVE-26156) Iceberg delete writer should handle deleting from old partition specs

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-26156:
--
Labels: pull-request-available  (was: )

> Iceberg delete writer should handle deleting from old partition specs
> -
>
> Key: HIVE-26156
> URL: https://issues.apache.org/jira/browse/HIVE-26156
> Project: Hive
>  Issue Type: Bug
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While {{HiveIcebergRecordWriter}} always writes data out according to the 
> latest spec, the {{HiveIcebergDeleteWriter}} might have to write delete files 
> into partitions that correspond to a variety of specs, both old and new. 
> Therefore we should pass the {{{}table.specs(){}}}map into the 
> {{HiveIcebergWriter}} so that the delete writer can choose the appropriate 
> spec on a per-record basis.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (HIVE-26156) Iceberg delete writer should handle deleting from old partition specs

2022-04-20 Thread Marton Bod (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marton Bod reassigned HIVE-26156:
-


> Iceberg delete writer should handle deleting from old partition specs
> -
>
> Key: HIVE-26156
> URL: https://issues.apache.org/jira/browse/HIVE-26156
> Project: Hive
>  Issue Type: Bug
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>
> While {{HiveIcebergRecordWriter}} always writes data out according to the 
> latest spec, the {{HiveIcebergDeleteWriter}} might have to write delete files 
> into partitions that correspond to a variety of specs, both old and new. 
> Therefore we should pass the {{{}table.specs(){}}}map into the 
> {{HiveIcebergWriter}} so that the delete writer can choose the appropriate 
> spec on a per-record basis.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759172=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759172
 ]

ASF GitHub Bot logged work on HIVE-26151:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 12:34
Start Date: 20/Apr/22 12:34
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on code in PR #3222:
URL: https://github.com/apache/hive/pull/3222#discussion_r854080693


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java:
##
@@ -163,4 +165,32 @@ public static void updateSpec(Configuration configuration, 
Table table) {
   public static boolean isBucketed(Table table) {
 return table.spec().fields().stream().anyMatch(f -> 
f.transform().toString().startsWith("bucket["));
   }
+
+  /**
+   * Returns the snapshot ID which is immediately before (or exactly at) the 
timestamp provided in millis.
+   * If the timestamp provided is before the first snapshot of the table, we 
return an empty optional.
+   * If the timestamp provided is in the future compared to the latest 
snapshot, we return the latest snapshot ID.
+   *
+   * E.g.: if we have snapshots S1, S2, S3 committed at times T3, T6, T9 
respectively (T0 = start of epoch), then:
+   * - from T0 to T2 -> returns empty
+   * - from T3 to T5 -> returns S1
+   * - from T6 to T8 -> returns S2
+   * - from T9 to T∞ -> returns S3
+   *
+   * @param table the table whose snapshot ID we are trying to find
+   * @param time the timestamp provided in milliseconds
+   * @return the snapshot ID corresponding to the time
+   */
+  public static Optional findSnapshotForTimestamp(Table table, long 
time) {
+if (table.history().get(0).timestampMillis() > time) {
+  return Optional.empty();
+}
+
+for (Snapshot snapshot : table.snapshots()) {

Review Comment:
   Thanks for the explanation!





Issue Time Tracking
---

Worklog Id: (was: 759172)
Time Spent: 1.5h  (was: 1h 20m)

> Support range-based time travel queries for Iceberg
> ---
>
> Key: HIVE-26151
> URL: https://issues.apache.org/jira/browse/HIVE-26151
> Project: Hive
>  Issue Type: New Feature
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Allow querying which records have been inserted during a certain time window 
> for Iceberg tables. The Iceberg TableScan API provides an implementation for 
> that, so most of the work would go into adding syntax support and 
> transporting the startTime and endTime parameters to the Iceberg input format.
> Proposed new syntax: 
> SELECT * FROM table FOR SYSTEM_TIME FROM '' TO ''
> SELECT * FROM table FOR SYSTEM_VERSION FROM  TO 
> (the TO clause is optional in both cases)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759152=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759152
 ]

ASF GitHub Bot logged work on HIVE-26151:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 12:04
Start Date: 20/Apr/22 12:04
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on code in PR #3222:
URL: https://github.com/apache/hive/pull/3222#discussion_r854053748


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java:
##
@@ -163,4 +165,32 @@ public static void updateSpec(Configuration configuration, 
Table table) {
   public static boolean isBucketed(Table table) {
 return table.spec().fields().stream().anyMatch(f -> 
f.transform().toString().startsWith("bucket["));
   }
+
+  /**
+   * Returns the snapshot ID which is immediately before (or exactly at) the 
timestamp provided in millis.
+   * If the timestamp provided is before the first snapshot of the table, we 
return an empty optional.
+   * If the timestamp provided is in the future compared to the latest 
snapshot, we return the latest snapshot ID.
+   *
+   * E.g.: if we have snapshots S1, S2, S3 committed at times T3, T6, T9 
respectively (T0 = start of epoch), then:
+   * - from T0 to T2 -> returns empty
+   * - from T3 to T5 -> returns S1
+   * - from T6 to T8 -> returns S2
+   * - from T9 to T∞ -> returns S3
+   *
+   * @param table the table whose snapshot ID we are trying to find
+   * @param time the timestamp provided in milliseconds
+   * @return the snapshot ID corresponding to the time
+   */
+  public static Optional findSnapshotForTimestamp(Table table, long 
time) {
+if (table.history().get(0).timestampMillis() > time) {
+  return Optional.empty();
+}
+
+for (Snapshot snapshot : table.snapshots()) {

Review Comment:
   Looks like the snapshots are ordered by commit time. 
   Whenever there's a commit, we take the existing list of the snapshots in the 
`TableMetadata.Builder` 
[here](https://github.com/apache/iceberg/blob/9618147b6de8f8627052a205b86e45263394b0c2/core/src/main/java/org/apache/iceberg/TableMetadata.java#L817),
 and simply append the new snapshot to the end 
[here](https://github.com/apache/iceberg/blob/9618147b6de8f8627052a205b86e45263394b0c2/core/src/main/java/org/apache/iceberg/TableMetadata.java#L994).
 
   And since it's a List, the iteration order will be deterministic.





Issue Time Tracking
---

Worklog Id: (was: 759152)
Time Spent: 1h 20m  (was: 1h 10m)

> Support range-based time travel queries for Iceberg
> ---
>
> Key: HIVE-26151
> URL: https://issues.apache.org/jira/browse/HIVE-26151
> Project: Hive
>  Issue Type: New Feature
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Allow querying which records have been inserted during a certain time window 
> for Iceberg tables. The Iceberg TableScan API provides an implementation for 
> that, so most of the work would go into adding syntax support and 
> transporting the startTime and endTime parameters to the Iceberg input format.
> Proposed new syntax: 
> SELECT * FROM table FOR SYSTEM_TIME FROM '' TO ''
> SELECT * FROM table FOR SYSTEM_VERSION FROM  TO 
> (the TO clause is optional in both cases)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759150=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759150
 ]

ASF GitHub Bot logged work on HIVE-26151:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 12:03
Start Date: 20/Apr/22 12:03
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on code in PR #3222:
URL: https://github.com/apache/hive/pull/3222#discussion_r854053748


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java:
##
@@ -163,4 +165,32 @@ public static void updateSpec(Configuration configuration, 
Table table) {
   public static boolean isBucketed(Table table) {
 return table.spec().fields().stream().anyMatch(f -> 
f.transform().toString().startsWith("bucket["));
   }
+
+  /**
+   * Returns the snapshot ID which is immediately before (or exactly at) the 
timestamp provided in millis.
+   * If the timestamp provided is before the first snapshot of the table, we 
return an empty optional.
+   * If the timestamp provided is in the future compared to the latest 
snapshot, we return the latest snapshot ID.
+   *
+   * E.g.: if we have snapshots S1, S2, S3 committed at times T3, T6, T9 
respectively (T0 = start of epoch), then:
+   * - from T0 to T2 -> returns empty
+   * - from T3 to T5 -> returns S1
+   * - from T6 to T8 -> returns S2
+   * - from T9 to T∞ -> returns S3
+   *
+   * @param table the table whose snapshot ID we are trying to find
+   * @param time the timestamp provided in milliseconds
+   * @return the snapshot ID corresponding to the time
+   */
+  public static Optional findSnapshotForTimestamp(Table table, long 
time) {
+if (table.history().get(0).timestampMillis() > time) {
+  return Optional.empty();
+}
+
+for (Snapshot snapshot : table.snapshots()) {

Review Comment:
   Looks like the snapshots are ordered by commit time. 
   Whenever there's a commit, we take the existing list of the snapshots 
[here](https://github.com/apache/iceberg/blob/9618147b6de8f8627052a205b86e45263394b0c2/core/src/main/java/org/apache/iceberg/TableMetadata.java#L817),
 and simply append the new snapshot to the end 
[here](https://github.com/apache/iceberg/blob/9618147b6de8f8627052a205b86e45263394b0c2/core/src/main/java/org/apache/iceberg/TableMetadata.java#L994).
 
   And since it's a List, the iteration order will be deterministic.





Issue Time Tracking
---

Worklog Id: (was: 759150)
Time Spent: 1h 10m  (was: 1h)

> Support range-based time travel queries for Iceberg
> ---
>
> Key: HIVE-26151
> URL: https://issues.apache.org/jira/browse/HIVE-26151
> Project: Hive
>  Issue Type: New Feature
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Allow querying which records have been inserted during a certain time window 
> for Iceberg tables. The Iceberg TableScan API provides an implementation for 
> that, so most of the work would go into adding syntax support and 
> transporting the startTime and endTime parameters to the Iceberg input format.
> Proposed new syntax: 
> SELECT * FROM table FOR SYSTEM_TIME FROM '' TO ''
> SELECT * FROM table FOR SYSTEM_VERSION FROM  TO 
> (the TO clause is optional in both cases)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26009) Determine number of buckets for implicitly bucketed ACIDv2 tables

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26009?focusedWorklogId=759108=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759108
 ]

ASF GitHub Bot logged work on HIVE-26009:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 10:38
Start Date: 20/Apr/22 10:38
Worklog Time Spent: 10m 
  Work Description: simhadri-g opened a new pull request, #3224:
URL: https://github.com/apache/hive/pull/3224

   
   
   
   
   ### What changes were proposed in this pull request?
   
   
   The change prevents reducer writing to ORC files to run with parallelism 1 
when the tables are bucketed.
   [HIVE-26009](https://issues.apache.org/jira/browse/HIVE-26009) and 
[HIVE-25611](https://issues.apache.org/jira/browse/HIVE-25611)
   
   
   ### Why are the changes needed?
   
   The numberOfBuckets for implicitly bucketed tables is set to -1 by default. 
When this is the case, it is left to hive to estimate the number of reducers 
required. This estimate is not optimal in all cases. Also, when we pick a large 
value for hive.exec.reducers.bytes.per.reducer before running the MERGE query. 
This forces the reducer writing to ORC files to run with parallelism 1 
simulating a scenario where we have many buckets in the table and the choice of 
parallelism does not take them into account this can lead to a significant 
bottleneck in performance.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Qtests and manual tests.




Issue Time Tracking
---

Worklog Id: (was: 759108)
Remaining Estimate: 0h
Time Spent: 10m

> Determine number of buckets for implicitly bucketed ACIDv2 tables 
> --
>
> Key: HIVE-26009
> URL: https://issues.apache.org/jira/browse/HIVE-26009
> Project: Hive
>  Issue Type: Improvement
>Reporter: Simhadri G
>Assignee: Simhadri G
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hive tries to set number of reducers equal to number of buckets here: 
> [https://github.com/apache/hive/blob/9857c4e584384f7b0a49c34bc2bdf876c2ea1503/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L6958]
>  
>  
> The numberOfBuckets for implicitly bucketed tables is set to -1 by default. 
> When this is the case, it is left to hive to estimate the number of reducers 
> required the job, based on job input, and configuration parameters.
> [https://github.com/apache/hive/blob/9857c4e584384f7b0a49c34bc2bdf876c2ea1503/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L3369]
>  
> This estimate is not optimal in all cases. In the worst case, it case result 
> in a single reducer being launched , which can lead to a significant 
> bottleneck in performance .
>  
> Ideally,  the number of reducers launched should equal to number of buckets, 
> which is the case for explicitly bucketed tables.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HIVE-26009) Determine number of buckets for implicitly bucketed ACIDv2 tables

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-26009:
--
Labels: pull-request-available  (was: )

> Determine number of buckets for implicitly bucketed ACIDv2 tables 
> --
>
> Key: HIVE-26009
> URL: https://issues.apache.org/jira/browse/HIVE-26009
> Project: Hive
>  Issue Type: Improvement
>Reporter: Simhadri G
>Assignee: Simhadri G
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hive tries to set number of reducers equal to number of buckets here: 
> [https://github.com/apache/hive/blob/9857c4e584384f7b0a49c34bc2bdf876c2ea1503/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L6958]
>  
>  
> The numberOfBuckets for implicitly bucketed tables is set to -1 by default. 
> When this is the case, it is left to hive to estimate the number of reducers 
> required the job, based on job input, and configuration parameters.
> [https://github.com/apache/hive/blob/9857c4e584384f7b0a49c34bc2bdf876c2ea1503/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L3369]
>  
> This estimate is not optimal in all cases. In the worst case, it case result 
> in a single reducer being launched , which can lead to a significant 
> bottleneck in performance .
>  
> Ideally,  the number of reducers launched should equal to number of buckets, 
> which is the case for explicitly bucketed tables.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759068=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759068
 ]

ASF GitHub Bot logged work on HIVE-26151:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 09:29
Start Date: 20/Apr/22 09:29
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on code in PR #3222:
URL: https://github.com/apache/hive/pull/3222#discussion_r853926634


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java:
##
@@ -163,4 +165,32 @@ public static void updateSpec(Configuration configuration, 
Table table) {
   public static boolean isBucketed(Table table) {
 return table.spec().fields().stream().anyMatch(f -> 
f.transform().toString().startsWith("bucket["));
   }
+
+  /**
+   * Returns the snapshot ID which is immediately before (or exactly at) the 
timestamp provided in millis.
+   * If the timestamp provided is before the first snapshot of the table, we 
return an empty optional.
+   * If the timestamp provided is in the future compared to the latest 
snapshot, we return the latest snapshot ID.
+   *
+   * E.g.: if we have snapshots S1, S2, S3 committed at times T3, T6, T9 
respectively (T0 = start of epoch), then:
+   * - from T0 to T2 -> returns empty
+   * - from T3 to T5 -> returns S1
+   * - from T6 to T8 -> returns S2
+   * - from T9 to T∞ -> returns S3
+   *
+   * @param table the table whose snapshot ID we are trying to find
+   * @param time the timestamp provided in milliseconds
+   * @return the snapshot ID corresponding to the time
+   */
+  public static Optional findSnapshotForTimestamp(Table table, long 
time) {
+if (table.history().get(0).timestampMillis() > time) {
+  return Optional.empty();
+}
+
+for (Snapshot snapshot : table.snapshots()) {

Review Comment:
   Actually this is only true for V2 tables. Let me debug into it a bit to see 
what's happening for V1





Issue Time Tracking
---

Worklog Id: (was: 759068)
Time Spent: 1h  (was: 50m)

> Support range-based time travel queries for Iceberg
> ---
>
> Key: HIVE-26151
> URL: https://issues.apache.org/jira/browse/HIVE-26151
> Project: Hive
>  Issue Type: New Feature
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Allow querying which records have been inserted during a certain time window 
> for Iceberg tables. The Iceberg TableScan API provides an implementation for 
> that, so most of the work would go into adding syntax support and 
> transporting the startTime and endTime parameters to the Iceberg input format.
> Proposed new syntax: 
> SELECT * FROM table FOR SYSTEM_TIME FROM '' TO ''
> SELECT * FROM table FOR SYSTEM_VERSION FROM  TO 
> (the TO clause is optional in both cases)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759067=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759067
 ]

ASF GitHub Bot logged work on HIVE-26151:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 09:27
Start Date: 20/Apr/22 09:27
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on code in PR #3222:
URL: https://github.com/apache/hive/pull/3222#discussion_r853924951


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##
@@ -207,6 +218,39 @@ public RecordReader createRecordReader(InputSplit 
split, TaskAttemptCon
 return new IcebergRecordReader<>();
   }
 
+  private static TableScan scanWithTimeRange(Table table, Configuration conf, 
TableScan scan, long fromTime) {
+// let's find the corresponding snapshot ID - if the fromTime is before 
the table creation happened, let's use
+// the first snapshot of the table
+long fromSnapshot = IcebergTableUtil.findSnapshotForTimestamp(table, 
fromTime)
+.orElseGet(() -> table.history().get(0).snapshotId());
+if (fromSnapshot == table.currentSnapshot().snapshotId()) {
+  throw new IllegalArgumentException(
+  "Provided FROM timestamp must be earlier than the latest snapshot of 
the table.");
+}
+long toTime = conf.getLong(InputFormatConfig.TO_TIMESTAMP, -1);
+if (toTime != -1) {
+  if (fromTime >= toTime) {
+throw new IllegalArgumentException(
+"Provided FROM timestamp must precede the provided TO timestamp.");
+  }
+  long toSnapshot = IcebergTableUtil.findSnapshotForTimestamp(table, 
toTime)
+  .orElseThrow(() -> new IllegalArgumentException(
+  "Provided TO timestamp must be after the first snapshot of the 
table."));
+  return scan.appendsBetween(fromSnapshot, toSnapshot);
+} else {
+  return scan.appendsAfter(fromSnapshot);
+}
+  }
+
+  private static TableScan scanWithVersionRange(Configuration conf, TableScan 
scan, long fromSnapshot) {
+long toSnapshot = conf.getLong(InputFormatConfig.TO_VERSION, -1);

Review Comment:
   Sure





Issue Time Tracking
---

Worklog Id: (was: 759067)
Time Spent: 50m  (was: 40m)

> Support range-based time travel queries for Iceberg
> ---
>
> Key: HIVE-26151
> URL: https://issues.apache.org/jira/browse/HIVE-26151
> Project: Hive
>  Issue Type: New Feature
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Allow querying which records have been inserted during a certain time window 
> for Iceberg tables. The Iceberg TableScan API provides an implementation for 
> that, so most of the work would go into adding syntax support and 
> transporting the startTime and endTime parameters to the Iceberg input format.
> Proposed new syntax: 
> SELECT * FROM table FOR SYSTEM_TIME FROM '' TO ''
> SELECT * FROM table FOR SYSTEM_VERSION FROM  TO 
> (the TO clause is optional in both cases)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759066=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759066
 ]

ASF GitHub Bot logged work on HIVE-26151:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 09:26
Start Date: 20/Apr/22 09:26
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on code in PR #3222:
URL: https://github.com/apache/hive/pull/3222#discussion_r853924488


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##
@@ -207,6 +218,39 @@ public RecordReader createRecordReader(InputSplit 
split, TaskAttemptCon
 return new IcebergRecordReader<>();
   }
 
+  private static TableScan scanWithTimeRange(Table table, Configuration conf, 
TableScan scan, long fromTime) {
+// let's find the corresponding snapshot ID - if the fromTime is before 
the table creation happened, let's use
+// the first snapshot of the table
+long fromSnapshot = IcebergTableUtil.findSnapshotForTimestamp(table, 
fromTime)
+.orElseGet(() -> table.history().get(0).snapshotId());
+if (fromSnapshot == table.currentSnapshot().snapshotId()) {
+  throw new IllegalArgumentException(
+  "Provided FROM timestamp must be earlier than the latest snapshot of 
the table.");
+}
+long toTime = conf.getLong(InputFormatConfig.TO_TIMESTAMP, -1);

Review Comment:
   Sure



##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##
@@ -207,6 +218,39 @@ public RecordReader createRecordReader(InputSplit 
split, TaskAttemptCon
 return new IcebergRecordReader<>();
   }
 
+  private static TableScan scanWithTimeRange(Table table, Configuration conf, 
TableScan scan, long fromTime) {
+// let's find the corresponding snapshot ID - if the fromTime is before 
the table creation happened, let's use
+// the first snapshot of the table
+long fromSnapshot = IcebergTableUtil.findSnapshotForTimestamp(table, 
fromTime)
+.orElseGet(() -> table.history().get(0).snapshotId());
+if (fromSnapshot == table.currentSnapshot().snapshotId()) {
+  throw new IllegalArgumentException(
+  "Provided FROM timestamp must be earlier than the latest snapshot of 
the table.");
+}
+long toTime = conf.getLong(InputFormatConfig.TO_TIMESTAMP, -1);
+if (toTime != -1) {
+  if (fromTime >= toTime) {

Review Comment:
   Yep, makes sense





Issue Time Tracking
---

Worklog Id: (was: 759066)
Time Spent: 40m  (was: 0.5h)

> Support range-based time travel queries for Iceberg
> ---
>
> Key: HIVE-26151
> URL: https://issues.apache.org/jira/browse/HIVE-26151
> Project: Hive
>  Issue Type: New Feature
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Allow querying which records have been inserted during a certain time window 
> for Iceberg tables. The Iceberg TableScan API provides an implementation for 
> that, so most of the work would go into adding syntax support and 
> transporting the startTime and endTime parameters to the Iceberg input format.
> Proposed new syntax: 
> SELECT * FROM table FOR SYSTEM_TIME FROM '' TO ''
> SELECT * FROM table FOR SYSTEM_VERSION FROM  TO 
> (the TO clause is optional in both cases)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759063=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759063
 ]

ASF GitHub Bot logged work on HIVE-26151:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 09:21
Start Date: 20/Apr/22 09:21
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on code in PR #3222:
URL: https://github.com/apache/hive/pull/3222#discussion_r853918915


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java:
##
@@ -163,4 +165,32 @@ public static void updateSpec(Configuration configuration, 
Table table) {
   public static boolean isBucketed(Table table) {
 return table.spec().fields().stream().anyMatch(f -> 
f.transform().toString().startsWith("bucket["));
   }
+
+  /**
+   * Returns the snapshot ID which is immediately before (or exactly at) the 
timestamp provided in millis.
+   * If the timestamp provided is before the first snapshot of the table, we 
return an empty optional.
+   * If the timestamp provided is in the future compared to the latest 
snapshot, we return the latest snapshot ID.
+   *
+   * E.g.: if we have snapshots S1, S2, S3 committed at times T3, T6, T9 
respectively (T0 = start of epoch), then:
+   * - from T0 to T2 -> returns empty
+   * - from T3 to T5 -> returns S1
+   * - from T6 to T8 -> returns S2
+   * - from T9 to T∞ -> returns S3
+   *
+   * @param table the table whose snapshot ID we are trying to find
+   * @param time the timestamp provided in milliseconds
+   * @return the snapshot ID corresponding to the time
+   */
+  public static Optional findSnapshotForTimestamp(Table table, long 
time) {
+if (table.history().get(0).timestampMillis() > time) {
+  return Optional.empty();
+}
+
+for (Snapshot snapshot : table.snapshots()) {

Review Comment:
   That's a good question. The snapshots come from `TableMetadata.snapshots()` 
which returns a `List`.  The snapshots seem to be sorted by sequence 
number, which means it's also sorted by snapshot time millis:
   
https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableMetadata.java#L982-L990





Issue Time Tracking
---

Worklog Id: (was: 759063)
Time Spent: 0.5h  (was: 20m)

> Support range-based time travel queries for Iceberg
> ---
>
> Key: HIVE-26151
> URL: https://issues.apache.org/jira/browse/HIVE-26151
> Project: Hive
>  Issue Type: New Feature
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Allow querying which records have been inserted during a certain time window 
> for Iceberg tables. The Iceberg TableScan API provides an implementation for 
> that, so most of the work would go into adding syntax support and 
> transporting the startTime and endTime parameters to the Iceberg input format.
> Proposed new syntax: 
> SELECT * FROM table FOR SYSTEM_TIME FROM '' TO ''
> SELECT * FROM table FOR SYSTEM_VERSION FROM  TO 
> (the TO clause is optional in both cases)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26151) Support range-based time travel queries for Iceberg

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26151?focusedWorklogId=759056=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759056
 ]

ASF GitHub Bot logged work on HIVE-26151:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 09:00
Start Date: 20/Apr/22 09:00
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on code in PR #3222:
URL: https://github.com/apache/hive/pull/3222#discussion_r853859878


##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##
@@ -207,6 +218,39 @@ public RecordReader createRecordReader(InputSplit 
split, TaskAttemptCon
 return new IcebergRecordReader<>();
   }
 
+  private static TableScan scanWithTimeRange(Table table, Configuration conf, 
TableScan scan, long fromTime) {
+// let's find the corresponding snapshot ID - if the fromTime is before 
the table creation happened, let's use
+// the first snapshot of the table
+long fromSnapshot = IcebergTableUtil.findSnapshotForTimestamp(table, 
fromTime)
+.orElseGet(() -> table.history().get(0).snapshotId());
+if (fromSnapshot == table.currentSnapshot().snapshotId()) {
+  throw new IllegalArgumentException(
+  "Provided FROM timestamp must be earlier than the latest snapshot of 
the table.");
+}
+long toTime = conf.getLong(InputFormatConfig.TO_TIMESTAMP, -1);
+if (toTime != -1) {
+  if (fromTime >= toTime) {

Review Comment:
   I think we can move this check to the beginning of the method, to spare some 
execution time.



##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java:
##
@@ -163,4 +165,32 @@ public static void updateSpec(Configuration configuration, 
Table table) {
   public static boolean isBucketed(Table table) {
 return table.spec().fields().stream().anyMatch(f -> 
f.transform().toString().startsWith("bucket["));
   }
+
+  /**
+   * Returns the snapshot ID which is immediately before (or exactly at) the 
timestamp provided in millis.
+   * If the timestamp provided is before the first snapshot of the table, we 
return an empty optional.
+   * If the timestamp provided is in the future compared to the latest 
snapshot, we return the latest snapshot ID.
+   *
+   * E.g.: if we have snapshots S1, S2, S3 committed at times T3, T6, T9 
respectively (T0 = start of epoch), then:
+   * - from T0 to T2 -> returns empty
+   * - from T3 to T5 -> returns S1
+   * - from T6 to T8 -> returns S2
+   * - from T9 to T∞ -> returns S3
+   *
+   * @param table the table whose snapshot ID we are trying to find
+   * @param time the timestamp provided in milliseconds
+   * @return the snapshot ID corresponding to the time
+   */
+  public static Optional findSnapshotForTimestamp(Table table, long 
time) {
+if (table.history().get(0).timestampMillis() > time) {
+  return Optional.empty();
+}
+
+for (Snapshot snapshot : table.snapshots()) {

Review Comment:
   Are we certain that the table.snapshots() returns a list sorted by snapshot 
time?



##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##
@@ -207,6 +218,39 @@ public RecordReader createRecordReader(InputSplit 
split, TaskAttemptCon
 return new IcebergRecordReader<>();
   }
 
+  private static TableScan scanWithTimeRange(Table table, Configuration conf, 
TableScan scan, long fromTime) {
+// let's find the corresponding snapshot ID - if the fromTime is before 
the table creation happened, let's use
+// the first snapshot of the table
+long fromSnapshot = IcebergTableUtil.findSnapshotForTimestamp(table, 
fromTime)
+.orElseGet(() -> table.history().get(0).snapshotId());
+if (fromSnapshot == table.currentSnapshot().snapshotId()) {
+  throw new IllegalArgumentException(
+  "Provided FROM timestamp must be earlier than the latest snapshot of 
the table.");
+}
+long toTime = conf.getLong(InputFormatConfig.TO_TIMESTAMP, -1);

Review Comment:
   nit: Can we move the toTime to the method param? 



##
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##
@@ -207,6 +218,39 @@ public RecordReader createRecordReader(InputSplit 
split, TaskAttemptCon
 return new IcebergRecordReader<>();
   }
 
+  private static TableScan scanWithTimeRange(Table table, Configuration conf, 
TableScan scan, long fromTime) {
+// let's find the corresponding snapshot ID - if the fromTime is before 
the table creation happened, let's use
+// the first snapshot of the table
+long fromSnapshot = IcebergTableUtil.findSnapshotForTimestamp(table, 
fromTime)
+.orElseGet(() -> table.history().get(0).snapshotId());
+if (fromSnapshot == table.currentSnapshot().snapshotId()) {
+  throw new IllegalArgumentException(
+ 

[jira] [Work logged] (HIVE-26074) PTF Vectorization: BoundaryScanner for varchar

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26074?focusedWorklogId=759044=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759044
 ]

ASF GitHub Bot logged work on HIVE-26074:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 08:31
Start Date: 20/Apr/22 08:31
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on code in PR #3187:
URL: https://github.com/apache/hive/pull/3187#discussion_r853870852


##
ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/ValueBoundaryScanner.java:
##
@@ -768,6 +774,9 @@ public static SingleValueBoundaryScanner 
getBoundaryScanner(BoundaryDef start, B
 case "string":
   return new StringPrimitiveValueBoundaryScanner(start, end, exprDef, 
nullsLast);
 default:
+  if (typeString.startsWith("char") || typeString.startsWith("varchar")) {

Review Comment:
   Done. As discussed I pulled all of them together to avoid multiple branches.





Issue Time Tracking
---

Worklog Id: (was: 759044)
Time Spent: 1h 20m  (was: 1h 10m)

> PTF Vectorization: BoundaryScanner for varchar
> --
>
> Key: HIVE-26074
> URL: https://issues.apache.org/jira/browse/HIVE-26074
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> HIVE-24761 should be extended for varchar, otherwise it fails on varchar type
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Internal Error: 
> attempt to setup a Window for typeString: 'varchar(170)'
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.SingleValueBoundaryScanner.getBoundaryScanner(ValueBoundaryScanner.java:773)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner$MultiPrimitiveValueBoundaryScanner.  (ValueBoundaryScanner.java:1257)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:1237)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.ValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:327)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.PTFRangeUtil.getRange(PTFRangeUtil.java:40)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFGroupBatches.finishPartition(VectorPTFGroupBatches.java:442)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.finishPartition(VectorPTFOperator.java:631)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.closeOp(VectorPTFOperator.java:782)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:731)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:755)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:383)
>   ... 16 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26074) PTF Vectorization: BoundaryScanner for varchar

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26074?focusedWorklogId=759040=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759040
 ]

ASF GitHub Bot logged work on HIVE-26074:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 08:30
Start Date: 20/Apr/22 08:30
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on code in PR #3187:
URL: https://github.com/apache/hive/pull/3187#discussion_r853869839


##
ql/src/test/queries/clientpositive/vector_ptf_bounded_start.q:
##
@@ -3,24 +3,31 @@ set hive.vectorized.execution.enabled=true;
 set hive.vectorized.execution.ptf.enabled=true;
 set hive.fetch.task.conversion=none;
 
-CREATE TABLE vector_ptf_part_simple_text(p_mfgr string, p_name string, p_date 
date, p_retailprice double, rowindex int)
+CREATE TABLE vector_ptf_part_simple_text(p_mfgr string, p_name string, p_date 
date, p_retailprice double,
+p_type char(1), p_varchar varchar(5), rowindex int)
 ROW FORMAT DELIMITED
-FIELDS TERMINATED BY '\t'
+FIELDS TERMINATED BY ','
 STORED AS TEXTFILE;
 LOAD DATA LOCAL INPATH 
'../../data/files/vector_ptf_part_simple_all_datatypes.txt' OVERWRITE INTO 
TABLE vector_ptf_part_simple_text;
 
+SELECT * from vector_ptf_part_simple_text;
+
 CREATE TABLE vector_ptf_part_simple_orc (p_mfgr string, p_name string, p_date 
date, p_timestamp timestamp, 
-p_int int, p_retailprice double, p_decimal decimal(10,4), rowindex int) stored 
as orc;
+p_int int, p_retailprice double, p_decimal decimal(10,4), p_type char(1), 
p_varchar varchar(5),rowindex int) stored

Review Comment:
   Changed.





Issue Time Tracking
---

Worklog Id: (was: 759040)
Time Spent: 1h  (was: 50m)

> PTF Vectorization: BoundaryScanner for varchar
> --
>
> Key: HIVE-26074
> URL: https://issues.apache.org/jira/browse/HIVE-26074
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> HIVE-24761 should be extended for varchar, otherwise it fails on varchar type
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Internal Error: 
> attempt to setup a Window for typeString: 'varchar(170)'
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.SingleValueBoundaryScanner.getBoundaryScanner(ValueBoundaryScanner.java:773)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner$MultiPrimitiveValueBoundaryScanner.  (ValueBoundaryScanner.java:1257)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:1237)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.ValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:327)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.PTFRangeUtil.getRange(PTFRangeUtil.java:40)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFGroupBatches.finishPartition(VectorPTFGroupBatches.java:442)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.finishPartition(VectorPTFOperator.java:631)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.closeOp(VectorPTFOperator.java:782)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:731)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:755)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:383)
>   ... 16 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26074) PTF Vectorization: BoundaryScanner for varchar

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26074?focusedWorklogId=759042=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759042
 ]

ASF GitHub Bot logged work on HIVE-26074:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 08:30
Start Date: 20/Apr/22 08:30
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on code in PR #3187:
URL: https://github.com/apache/hive/pull/3187#discussion_r853870053


##
ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/ValueBoundaryScanner.java:
##
@@ -1214,6 +1223,55 @@ public boolean isEqualPrimitive(String s1, String s2) {
   }
 }
 
+class CharValueBoundaryScanner extends SingleValueBoundaryScanner {
+  public CharValueBoundaryScanner(BoundaryDef start, BoundaryDef end,
+  OrderExpressionDef expressionDef, boolean nullsLast) {
+super(start, end, expressionDef, nullsLast);
+  }
+
+  @Override
+  public boolean isDistanceGreater(Object v1, Object v2, int amt) {
+HiveChar s1 = PrimitiveObjectInspectorUtils.getHiveChar(v1,
+(PrimitiveObjectInspector) expressionDef.getOI());
+HiveChar s2 = PrimitiveObjectInspectorUtils.getHiveChar(v2,
+(PrimitiveObjectInspector) expressionDef.getOI());
+return s1 != null && s2 != null && s1.compareTo(s2) > 0;
+  }
+
+  @Override
+  public boolean isEqual(Object v1, Object v2) {
+HiveChar s1 = PrimitiveObjectInspectorUtils.getHiveChar(v1,
+(PrimitiveObjectInspector) expressionDef.getOI());
+HiveChar s2 = PrimitiveObjectInspectorUtils.getHiveChar(v2,
+(PrimitiveObjectInspector) expressionDef.getOI());
+return (s1 == null && s2 == null) || (s1 != null && s1.equals(s2));
+  }
+}
+
+class VarcharValueBoundaryScanner extends SingleValueBoundaryScanner {
+  public VarcharValueBoundaryScanner(BoundaryDef start, BoundaryDef end,
+  OrderExpressionDef expressionDef, boolean nullsLast) {
+super(start, end, expressionDef, nullsLast);
+  }
+
+  @Override
+  public boolean isDistanceGreater(Object v1, Object v2, int amt) {
+HiveVarchar s1 = PrimitiveObjectInspectorUtils.getHiveVarchar(v1,
+(PrimitiveObjectInspector) expressionDef.getOI());
+HiveVarchar s2 = PrimitiveObjectInspectorUtils.getHiveVarchar(v2,
+(PrimitiveObjectInspector) expressionDef.getOI());
+return s1 != null && s2 != null && s1.compareTo(s2) > 0;
+  }
+
+  @Override
+  public boolean isEqual(Object v1, Object v2) {

Review Comment:
   Added





Issue Time Tracking
---

Worklog Id: (was: 759042)
Time Spent: 1h 10m  (was: 1h)

> PTF Vectorization: BoundaryScanner for varchar
> --
>
> Key: HIVE-26074
> URL: https://issues.apache.org/jira/browse/HIVE-26074
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> HIVE-24761 should be extended for varchar, otherwise it fails on varchar type
> {code}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Internal Error: 
> attempt to setup a Window for typeString: 'varchar(170)'
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.SingleValueBoundaryScanner.getBoundaryScanner(ValueBoundaryScanner.java:773)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner$MultiPrimitiveValueBoundaryScanner.  (ValueBoundaryScanner.java:1257)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.MultiValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:1237)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.ValueBoundaryScanner.getScanner(ValueBoundaryScanner.java:327)
>   at 
> org.apache.hadoop.hive.ql.udf.ptf.PTFRangeUtil.getRange(PTFRangeUtil.java:40)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFGroupBatches.finishPartition(VectorPTFGroupBatches.java:442)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.finishPartition(VectorPTFOperator.java:631)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.ptf.VectorPTFOperator.closeOp(VectorPTFOperator.java:782)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:731)
>   at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:755)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:383)
>   ... 16 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work logged] (HIVE-26074) PTF Vectorization: BoundaryScanner for varchar

2022-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26074?focusedWorklogId=759009=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759009
 ]

ASF GitHub Bot logged work on HIVE-26074:
-

Author: ASF GitHub Bot
Created on: 20/Apr/22 07:35
Start Date: 20/Apr/22 07:35
Worklog Time Spent: 10m 
  Work Description: abstractdog commented on code in PR #3187:
URL: https://github.com/apache/hive/pull/3187#discussion_r845932714


##
ql/src/test/queries/clientpositive/vector_ptf_bounded_start.q:
##
@@ -3,24 +3,31 @@ set hive.vectorized.execution.enabled=true;
 set hive.vectorized.execution.ptf.enabled=true;
 set hive.fetch.task.conversion=none;
 
-CREATE TABLE vector_ptf_part_simple_text(p_mfgr string, p_name string, p_date 
date, p_retailprice double, rowindex int)
+CREATE TABLE vector_ptf_part_simple_text(p_mfgr string, p_name string, p_date 
date, p_retailprice double,
+p_type char(1), p_varchar varchar(5), rowindex int)
 ROW FORMAT DELIMITED
-FIELDS TERMINATED BY '\t'
+FIELDS TERMINATED BY ','
 STORED AS TEXTFILE;
 LOAD DATA LOCAL INPATH 
'../../data/files/vector_ptf_part_simple_all_datatypes.txt' OVERWRITE INTO 
TABLE vector_ptf_part_simple_text;
 
+SELECT * from vector_ptf_part_simple_text;
+
 CREATE TABLE vector_ptf_part_simple_orc (p_mfgr string, p_name string, p_date 
date, p_timestamp timestamp, 
-p_int int, p_retailprice double, p_decimal decimal(10,4), rowindex int) stored 
as orc;
+p_int int, p_retailprice double, p_decimal decimal(10,4), p_type char(1), 
p_varchar varchar(5),rowindex int) stored

Review Comment:
   let this be p_char instead of p_type



##
ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/ValueBoundaryScanner.java:
##
@@ -1214,6 +1223,55 @@ public boolean isEqualPrimitive(String s1, String s2) {
   }
 }
 
+class CharValueBoundaryScanner extends SingleValueBoundaryScanner {
+  public CharValueBoundaryScanner(BoundaryDef start, BoundaryDef end,
+  OrderExpressionDef expressionDef, boolean nullsLast) {
+super(start, end, expressionDef, nullsLast);
+  }
+
+  @Override
+  public boolean isDistanceGreater(Object v1, Object v2, int amt) {
+HiveChar s1 = PrimitiveObjectInspectorUtils.getHiveChar(v1,
+(PrimitiveObjectInspector) expressionDef.getOI());
+HiveChar s2 = PrimitiveObjectInspectorUtils.getHiveChar(v2,
+(PrimitiveObjectInspector) expressionDef.getOI());
+return s1 != null && s2 != null && s1.compareTo(s2) > 0;
+  }
+
+  @Override
+  public boolean isEqual(Object v1, Object v2) {
+HiveChar s1 = PrimitiveObjectInspectorUtils.getHiveChar(v1,
+(PrimitiveObjectInspector) expressionDef.getOI());
+HiveChar s2 = PrimitiveObjectInspectorUtils.getHiveChar(v2,
+(PrimitiveObjectInspector) expressionDef.getOI());
+return (s1 == null && s2 == null) || (s1 != null && s1.equals(s2));
+  }
+}
+
+class VarcharValueBoundaryScanner extends SingleValueBoundaryScanner {
+  public VarcharValueBoundaryScanner(BoundaryDef start, BoundaryDef end,
+  OrderExpressionDef expressionDef, boolean nullsLast) {
+super(start, end, expressionDef, nullsLast);
+  }
+
+  @Override
+  public boolean isDistanceGreater(Object v1, Object v2, int amt) {
+HiveVarchar s1 = PrimitiveObjectInspectorUtils.getHiveVarchar(v1,
+(PrimitiveObjectInspector) expressionDef.getOI());
+HiveVarchar s2 = PrimitiveObjectInspectorUtils.getHiveVarchar(v2,
+(PrimitiveObjectInspector) expressionDef.getOI());
+return s1 != null && s2 != null && s1.compareTo(s2) > 0;
+  }
+
+  @Override
+  public boolean isEqual(Object v1, Object v2) {

Review Comment:
   can you please add isEqual testcase to TestValueBoundaryScanner?



##
ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/ValueBoundaryScanner.java:
##
@@ -768,6 +774,9 @@ public static SingleValueBoundaryScanner 
getBoundaryScanner(BoundaryDef start, B
 case "string":
   return new StringPrimitiveValueBoundaryScanner(start, end, exprDef, 
nullsLast);
 default:
+  if (typeString.startsWith("char") || typeString.startsWith("varchar")) {

Review Comment:
   the same is handled for decimal above:
   ```
   if (typeString.startsWith("decimal")){
 typeString = "decimal"; //DecimalTypeInfo.getTypeName() includes 
scale/precision: "decimal(10,4)"
   }
   ```





Issue Time Tracking
---

Worklog Id: (was: 759009)
Time Spent: 50m  (was: 40m)

> PTF Vectorization: BoundaryScanner for varchar
> --
>
> Key: HIVE-26074
> URL: https://issues.apache.org/jira/browse/HIVE-26074
> Project: Hive
>  Issue Type: Bug
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
>  Time 

[jira] [Updated] (HIVE-26146) Handle missing hive.acid.key.index in the fixacidkeyindex utility

2022-04-20 Thread Alessandro Solimando (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Solimando updated HIVE-26146:

Description: 
There is a utility in hive which can validate/fix corrupted 
_hive.acid.key.index_: 

{code:bash}
hive --service fixacidkeyindex $orcfilepath
{code}

At the moment the utility throws a NPE if the _hive.acid.key.index_ metadata 
entry is missing:

{noformat}
ERROR checking /hive-dev-box/multistripe_ko_acid.orc
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.validate(FixAcidKeyIndex.java:183)
at 
org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.checkFile(FixAcidKeyIndex.java:147)
at 
org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.checkFiles(FixAcidKeyIndex.java:130)
at 
org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.main(FixAcidKeyIndex.java:106)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:308)
at org.apache.hadoop.util.RunJar.main(RunJar.java:222)
{noformat}

The aim of this ticket is to handle such case in order to support re-generating 
this metadata entry even when it is missing.

  was:
There is a utility in hive which can validate/fix corrupted 
_hive.acid.key.index_: 

{code:bash}
hive --service fixacidkeyindex
{code}

At the moment the utility throws a NPE if the _hive.acid.key.index_ metadata 
entry is missing:

{noformat}
ERROR checking /hive-dev-box/multistripe_ko_acid.orc
java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.validate(FixAcidKeyIndex.java:183)
at 
org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.checkFile(FixAcidKeyIndex.java:147)
at 
org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.checkFiles(FixAcidKeyIndex.java:130)
at 
org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.main(FixAcidKeyIndex.java:106)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:308)
at org.apache.hadoop.util.RunJar.main(RunJar.java:222)
{noformat}

The aim of this ticket is to handle such case in order to support re-generating 
this metadata entry even when it is missing.


> Handle missing hive.acid.key.index in the fixacidkeyindex utility
> -
>
> Key: HIVE-26146
> URL: https://issues.apache.org/jira/browse/HIVE-26146
> Project: Hive
>  Issue Type: Improvement
>  Components: ORC, Transactions
>Affects Versions: 4.0.0-alpha-2
>Reporter: Alessandro Solimando
>Assignee: Alessandro Solimando
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0-alpha-2
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> There is a utility in hive which can validate/fix corrupted 
> _hive.acid.key.index_: 
> {code:bash}
> hive --service fixacidkeyindex $orcfilepath
> {code}
> At the moment the utility throws a NPE if the _hive.acid.key.index_ metadata 
> entry is missing:
> {noformat}
> ERROR checking /hive-dev-box/multistripe_ko_acid.orc
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.validate(FixAcidKeyIndex.java:183)
> at 
> org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.checkFile(FixAcidKeyIndex.java:147)
> at 
> org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.checkFiles(FixAcidKeyIndex.java:130)
> at 
> org.apache.hadoop.hive.ql.io.orc.FixAcidKeyIndex.main(FixAcidKeyIndex.java:106)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:308)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:222)
> {noformat}
> The aim of this ticket is to handle such case in order to support 
> re-generating this metadata entry even when it is missing.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)