[jira] [Commented] (HIVE-28258) Use Iceberg semantics for Merge task

2024-05-22 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848458#comment-17848458
 ] 

Sourabh Badhya commented on HIVE-28258:
---

[~kkasa] , the following task mainly tries to reuse the existing Iceberg 
readers (IcebergRecordReader) rather than using the file-format readers 
according to the table format. This way we can use the existing code for 
handling different file formats (ORC, Parquet, Avro) within Iceberg and avoid 
writing any custom implementations to handle these file-formats.

Additionally, this will help in handling different schemas that Iceberg 
maintains (the data schema and the delete schema) within Iceberg, and not 
expose it through public APIs.

Custom hacks like changing the file format of the merge task is also removed 
which was done earlier.

The existing tests iceberg_merge_files.q should serve as an example for 
debugging the merge task used for Iceberg.

> Use Iceberg semantics for Merge task
> 
>
> Key: HIVE-28258
> URL: https://issues.apache.org/jira/browse/HIVE-28258
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
>
> Use Iceberg semantics for Merge task, instead of normal ORC or parquet 
> readers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28267) Support merge task functionality for Iceberg delete files

2024-05-17 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-28267:
-

 Summary: Support merge task functionality for Iceberg delete files
 Key: HIVE-28267
 URL: https://issues.apache.org/jira/browse/HIVE-28267
 Project: Hive
  Issue Type: Improvement
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


Support merge task functionality for Iceberg delete files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28258) Use Iceberg semantics for Merge task

2024-05-14 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-28258:
-

 Summary: Use Iceberg semantics for Merge task
 Key: HIVE-28258
 URL: https://issues.apache.org/jira/browse/HIVE-28258
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


Use Iceberg semantics for Merge task, instead of normal ORC or parquet readers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28167) Full table deletion fails when converting to truncate for Iceberg and ACID tables

2024-04-30 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842311#comment-17842311
 ] 

Sourabh Badhya commented on HIVE-28167:
---

[~zabetak] , Issuing a truncate command is far more efficient than performing a 
full-table delete operation which would just create more files. Hence we would 
like this to be the default behaviour. The config is deprecated since it was 
already released in 4.0.0. We would like to deprecate it first and then 
completely remove it in the next release.

> Full table deletion fails when converting to truncate for Iceberg and ACID 
> tables
> -
>
> Key: HIVE-28167
> URL: https://issues.apache.org/jira/browse/HIVE-28167
> Project: Hive
>  Issue Type: Bug
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> A simple repro - 
> {code:java}
> create table ice01 (id int, key int) stored by iceberg stored as orc 
> tblproperties ('format-version'='2', 'write.delete.mode'='copy-on-write');
> insert into ice01 values (1,1),(2,1),(3,1),(4,1);
> insert into ice01 values (1,2),(2,2),(3,2),(4,2);
> insert into ice01 values (1,3),(2,3),(3,3),(4,3);
> insert into ice01 values (1,4),(2,4),(3,4),(4,4);
> insert into ice01 values (1,5),(2,5),(3,5),(4,5);
> explain analyze delete from ice01;
> delete from ice01;
> select count(*) from ice01;
> select * from ice01;
> describe formatted ice01; {code}
> The solution is to convert full table deletion to a truncate operation on the 
> table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-28227) Change the description of HIVE_OPTIMIZE_METADATA_DELETE config

2024-04-30 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-28227.
---
Fix Version/s: 4.1.0
   Resolution: Fixed

Merged to master.
Thanks [~ayushtkn] for the reviews.

> Change the description of HIVE_OPTIMIZE_METADATA_DELETE config
> --
>
> Key: HIVE-28227
> URL: https://issues.apache.org/jira/browse/HIVE-28227
> Project: Hive
>  Issue Type: Task
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> The current description is misleading. Change it to - 
> {code:java}
> "Optimize delete using filters provided by the query. This uses the metadata 
> of the table provided by table formats like Iceberg." {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28227) Change the description of HIVE_OPTIMIZE_METADATA_DELETE config

2024-04-28 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-28227:
-

 Summary: Change the description of HIVE_OPTIMIZE_METADATA_DELETE 
config
 Key: HIVE-28227
 URL: https://issues.apache.org/jira/browse/HIVE-28227
 Project: Hive
  Issue Type: Task
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


The current description is misleading. Change it to - 
{code:java}
"Optimize delete using filters provided by the query. This uses the metadata of 
the table provided by table formats like Iceberg." {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-28148) Implement array_compact UDF to remove all nulls from an array

2024-04-05 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-28148.
---
Fix Version/s: 4.1.0
   Resolution: Fixed

Merged to master.
Thanks [~tarak271] for the contribution.

> Implement array_compact UDF to remove all nulls from an array
> -
>
> Key: HIVE-28148
> URL: https://issues.apache.org/jira/browse/HIVE-28148
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: Taraka Rama Rao Lethavadla
>Assignee: Taraka Rama Rao Lethavadla
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> *array_compact(array)*
> Removes NULL elements from {{{}array{}}}.
>  
> {noformat}
> SELECT array_compact(array(1, 2, NULL, 3, NULL, 3));
> => [1, 2, 3, 3]
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28167) Full table deletion fails when converting to truncate for Iceberg and ACID tables

2024-04-02 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-28167:
--
Summary: Full table deletion fails when converting to truncate for Iceberg 
and ACID tables  (was: HIVE-28167: Full table deletion fails when converting to 
truncate for Iceberg and ACID tables)

> Full table deletion fails when converting to truncate for Iceberg and ACID 
> tables
> -
>
> Key: HIVE-28167
> URL: https://issues.apache.org/jira/browse/HIVE-28167
> Project: Hive
>  Issue Type: Bug
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> A simple repro - 
> {code:java}
> create table ice01 (id int, key int) stored by iceberg stored as orc 
> tblproperties ('format-version'='2', 'write.delete.mode'='copy-on-write');
> insert into ice01 values (1,1),(2,1),(3,1),(4,1);
> insert into ice01 values (1,2),(2,2),(3,2),(4,2);
> insert into ice01 values (1,3),(2,3),(3,3),(4,3);
> insert into ice01 values (1,4),(2,4),(3,4),(4,4);
> insert into ice01 values (1,5),(2,5),(3,5),(4,5);
> explain analyze delete from ice01;
> delete from ice01;
> select count(*) from ice01;
> select * from ice01;
> describe formatted ice01; {code}
> The solution is to convert full table deletion to a truncate operation on the 
> table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-28167) HIVE-28167: Full table deletion fails when converting to truncate for Iceberg and ACID tables

2024-04-02 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-28167.
---
Fix Version/s: 4.1.0
   Resolution: Fixed

Merged to master.
Thanks [~dkuzmenko] for the reviews.

> HIVE-28167: Full table deletion fails when converting to truncate for Iceberg 
> and ACID tables
> -
>
> Key: HIVE-28167
> URL: https://issues.apache.org/jira/browse/HIVE-28167
> Project: Hive
>  Issue Type: Bug
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> A simple repro - 
> {code:java}
> create table ice01 (id int, key int) stored by iceberg stored as orc 
> tblproperties ('format-version'='2', 'write.delete.mode'='copy-on-write');
> insert into ice01 values (1,1),(2,1),(3,1),(4,1);
> insert into ice01 values (1,2),(2,2),(3,2),(4,2);
> insert into ice01 values (1,3),(2,3),(3,3),(4,3);
> insert into ice01 values (1,4),(2,4),(3,4),(4,4);
> insert into ice01 values (1,5),(2,5),(3,5),(4,5);
> explain analyze delete from ice01;
> delete from ice01;
> select count(*) from ice01;
> select * from ice01;
> describe formatted ice01; {code}
> The solution is to convert full table deletion to a truncate operation on the 
> table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28167) HIVE-28167: Full table deletion fails when converting to truncate for Iceberg and ACID tables

2024-04-02 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-28167:
--
Summary: HIVE-28167: Full table deletion fails when converting to truncate 
for Iceberg and ACID tables  (was: Iceberg: Full table deletion fails when 
using Copy-on-write)

> HIVE-28167: Full table deletion fails when converting to truncate for Iceberg 
> and ACID tables
> -
>
> Key: HIVE-28167
> URL: https://issues.apache.org/jira/browse/HIVE-28167
> Project: Hive
>  Issue Type: Bug
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
>
> A simple repro - 
> {code:java}
> create table ice01 (id int, key int) stored by iceberg stored as orc 
> tblproperties ('format-version'='2', 'write.delete.mode'='copy-on-write');
> insert into ice01 values (1,1),(2,1),(3,1),(4,1);
> insert into ice01 values (1,2),(2,2),(3,2),(4,2);
> insert into ice01 values (1,3),(2,3),(3,3),(4,3);
> insert into ice01 values (1,4),(2,4),(3,4),(4,4);
> insert into ice01 values (1,5),(2,5),(3,5),(4,5);
> explain analyze delete from ice01;
> delete from ice01;
> select count(*) from ice01;
> select * from ice01;
> describe formatted ice01; {code}
> The solution is to convert full table deletion to a truncate operation on the 
> table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28167) Iceberg: Full table deletion fails when using Copy-on-write

2024-03-31 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-28167:
-

 Summary: Iceberg: Full table deletion fails when using 
Copy-on-write
 Key: HIVE-28167
 URL: https://issues.apache.org/jira/browse/HIVE-28167
 Project: Hive
  Issue Type: Bug
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


A simple repro - 
{code:java}
create table ice01 (id int, key int) stored by iceberg stored as orc 
tblproperties ('format-version'='2', 'write.delete.mode'='copy-on-write');

insert into ice01 values (1,1),(2,1),(3,1),(4,1);
insert into ice01 values (1,2),(2,2),(3,2),(4,2);
insert into ice01 values (1,3),(2,3),(3,3),(4,3);
insert into ice01 values (1,4),(2,4),(3,4),(4,4);
insert into ice01 values (1,5),(2,5),(3,5),(4,5);

explain analyze delete from ice01;

delete from ice01;

select count(*) from ice01;
select * from ice01;
describe formatted ice01; {code}
The solution is to convert full table deletion to a truncate operation on the 
table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28069) Iceberg: Implement Merge task functionality for Iceberg tables

2024-03-26 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830981#comment-17830981
 ] 

Sourabh Badhya commented on HIVE-28069:
---

Merged the addendum to master.
Thanks [~dkuzmenko] for the review.

> Iceberg: Implement Merge task functionality for Iceberg tables
> --
>
> Key: HIVE-28069
> URL: https://issues.apache.org/jira/browse/HIVE-28069
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Affects Versions: 4.0.0
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> Implement Merge task functionality for Iceberg tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27671) Implement array_append UDF to append an element to array

2024-03-25 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-27671.
---
Fix Version/s: 4.1.0
   Resolution: Fixed

Merged to master.
Thanks [~tarak271] for the contribution.

> Implement array_append UDF to append an element to array
> 
>
> Key: HIVE-27671
> URL: https://issues.apache.org/jira/browse/HIVE-27671
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Taraka Rama Rao Lethavadla
>Assignee: Taraka Rama Rao Lethavadla
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> *array_append(array, elem)*
> Returns {{an array}} appended by {{{}elem{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-28087) Iceberg: Timestamp partition columns with transforms are not correctly sorted during insert

2024-03-22 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-28087.
---
Fix Version/s: 4.1.0
   Resolution: Fixed

Merged to master.
Thanks [~ayushtkn] and [~simhadri-g] for the reviews.

> Iceberg: Timestamp partition columns with transforms are not correctly sorted 
> during insert
> ---
>
> Key: HIVE-28087
> URL: https://issues.apache.org/jira/browse/HIVE-28087
> Project: Hive
>  Issue Type: Task
>Reporter: Simhadri Govindappa
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
> Attachments: query-hive-377.csv
>
>
> Insert into partitioned table fails with the following error if the data is 
> not clustered.
> *Using cluster by clause it succeeds :* 
> {noformat}
> 0: jdbc:hive2://localhost:10001/> insert into table partition_transform_4 
> select t, ts from t1 cluster by ts;
> --
> VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED
> --
> Map 1 .. container SUCCEEDED  1  100  
>  0   0
> Reducer 2 .. container SUCCEEDED  1  100  
>  0   0
> --
> VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 9.47 s
> --
> INFO  : Starting task [Stage-2:DEPENDENCY_COLLECTION] in serial mode
> INFO  : Starting task [Stage-0:MOVE] in serial mode
> INFO  : Completed executing 
> command(queryId=root_20240222123244_0c448b32-4fd9-420d-be31-e39e2972af82); 
> Time taken: 10.534 seconds
> 100 rows affected (10.696 seconds){noformat}
>  
> *Without cluster By it fails:* 
> {noformat}
> 0: jdbc:hive2://localhost:10001/> insert into table partition_transform_4 
> select t, ts from t1;
> --
> VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED
> --
> Map 1 .. container SUCCEEDED  1  100  
>  0   0
> Reducer 2container   RUNNING  1  010  
>  2   0
> --
> VERTICES: 01/02  [=>>-] 50%   ELAPSED TIME: 9.53 s
> --
> Caused by: java.lang.IllegalStateException: Incoming records violate the 
> writer assumption that records are clustered by spec and by partition within 
> each spec. Either cluster the incoming records or switch to fanout writers.
> Encountered records that belong to already closed files:
> partition 'ts_month=2027-03' in spec [
>   1000: ts_month: month(2)
> ]
>   at org.apache.iceberg.io.ClusteredWriter.write(ClusteredWriter.java:96)
>   at 
> org.apache.iceberg.io.ClusteredDataWriter.write(ClusteredDataWriter.java:31)
>   at 
> org.apache.iceberg.mr.hive.writer.HiveIcebergRecordWriter.write(HiveIcebergRecordWriter.java:53)
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:1181)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorFileSinkOperator.process(VectorFileSinkOperator.java:111)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:158)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processVectorGroup(ReduceRecordSource.java:502)
>   ... 20 more{noformat}
>  
>  
> A simple repro, using the attached csv file: 
> [^query-hive-377.csv]
> {noformat}
> create database t3;
> use t3;
> create table vector1k(
>         t int,
>         si int,
>         i int,
>         b bigint,
>         f float,
>         d double,
>         dc decimal(38,18),
>         bo boolean,
>         s string,
>         s2 string,
>         ts timestamp,
>         ts2 timestamp,
>         dt date)
>      row format delimited fields terminated by ',';
> load data local inpath "/query-hive-377.csv" OVERWRITE into table vector1k; 
> select * from vector1k; create table vectortab10k(
>         t int,
>      

[jira] [Assigned] (HIVE-28087) Iceberg: Timestamp partition columns with transforms are not correctly sorted during insert

2024-03-22 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya reassigned HIVE-28087:
-

Assignee: Sourabh Badhya  (was: Simhadri Govindappa)

> Iceberg: Timestamp partition columns with transforms are not correctly sorted 
> during insert
> ---
>
> Key: HIVE-28087
> URL: https://issues.apache.org/jira/browse/HIVE-28087
> Project: Hive
>  Issue Type: Task
>Reporter: Simhadri Govindappa
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Attachments: query-hive-377.csv
>
>
> Insert into partitioned table fails with the following error if the data is 
> not clustered.
> *Using cluster by clause it succeeds :* 
> {noformat}
> 0: jdbc:hive2://localhost:10001/> insert into table partition_transform_4 
> select t, ts from t1 cluster by ts;
> --
> VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED
> --
> Map 1 .. container SUCCEEDED  1  100  
>  0   0
> Reducer 2 .. container SUCCEEDED  1  100  
>  0   0
> --
> VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 9.47 s
> --
> INFO  : Starting task [Stage-2:DEPENDENCY_COLLECTION] in serial mode
> INFO  : Starting task [Stage-0:MOVE] in serial mode
> INFO  : Completed executing 
> command(queryId=root_20240222123244_0c448b32-4fd9-420d-be31-e39e2972af82); 
> Time taken: 10.534 seconds
> 100 rows affected (10.696 seconds){noformat}
>  
> *Without cluster By it fails:* 
> {noformat}
> 0: jdbc:hive2://localhost:10001/> insert into table partition_transform_4 
> select t, ts from t1;
> --
> VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED
> --
> Map 1 .. container SUCCEEDED  1  100  
>  0   0
> Reducer 2container   RUNNING  1  010  
>  2   0
> --
> VERTICES: 01/02  [=>>-] 50%   ELAPSED TIME: 9.53 s
> --
> Caused by: java.lang.IllegalStateException: Incoming records violate the 
> writer assumption that records are clustered by spec and by partition within 
> each spec. Either cluster the incoming records or switch to fanout writers.
> Encountered records that belong to already closed files:
> partition 'ts_month=2027-03' in spec [
>   1000: ts_month: month(2)
> ]
>   at org.apache.iceberg.io.ClusteredWriter.write(ClusteredWriter.java:96)
>   at 
> org.apache.iceberg.io.ClusteredDataWriter.write(ClusteredDataWriter.java:31)
>   at 
> org.apache.iceberg.mr.hive.writer.HiveIcebergRecordWriter.write(HiveIcebergRecordWriter.java:53)
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:1181)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorFileSinkOperator.process(VectorFileSinkOperator.java:111)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:158)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processVectorGroup(ReduceRecordSource.java:502)
>   ... 20 more{noformat}
>  
>  
> A simple repro, using the attached csv file: 
> [^query-hive-377.csv]
> {noformat}
> create database t3;
> use t3;
> create table vector1k(
>         t int,
>         si int,
>         i int,
>         b bigint,
>         f float,
>         d double,
>         dc decimal(38,18),
>         bo boolean,
>         s string,
>         s2 string,
>         ts timestamp,
>         ts2 timestamp,
>         dt date)
>      row format delimited fields terminated by ',';
> load data local inpath "/query-hive-377.csv" OVERWRITE into table vector1k; 
> select * from vector1k; create table vectortab10k(
>         t int,
>         si int,
>         i int,
>         b bigint,
>         f float,
>         d double,
>    

[jira] [Resolved] (HIVE-28069) Iceberg: Implement Merge task functionality for Iceberg tables

2024-03-21 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-28069.
---
Fix Version/s: 4.1.0
   Resolution: Fixed

Merged to master.
Thanks [~dkuzmenko] and [~kkasa] for the reviews.

> Iceberg: Implement Merge task functionality for Iceberg tables
> --
>
> Key: HIVE-28069
> URL: https://issues.apache.org/jira/browse/HIVE-28069
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Affects Versions: 4.0.0
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> Implement Merge task functionality for Iceberg tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28087) Iceberg: Timestamp partition columns with transforms are not correctly sorted during insert

2024-03-19 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-28087:
--
Summary: Iceberg: Timestamp partition columns with transforms are not 
correctly sorted during insert  (was: Hive Iceberg: Insert into partitioned 
table  fails if the data is not clustered)

> Iceberg: Timestamp partition columns with transforms are not correctly sorted 
> during insert
> ---
>
> Key: HIVE-28087
> URL: https://issues.apache.org/jira/browse/HIVE-28087
> Project: Hive
>  Issue Type: Task
>Reporter: Simhadri Govindappa
>Assignee: Simhadri Govindappa
>Priority: Major
> Attachments: query-hive-377.csv
>
>
> Insert into partitioned table fails with the following error if the data is 
> not clustered.
> *Using cluster by clause it succeeds :* 
> {noformat}
> 0: jdbc:hive2://localhost:10001/> insert into table partition_transform_4 
> select t, ts from t1 cluster by ts;
> --
> VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED
> --
> Map 1 .. container SUCCEEDED  1  100  
>  0   0
> Reducer 2 .. container SUCCEEDED  1  100  
>  0   0
> --
> VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 9.47 s
> --
> INFO  : Starting task [Stage-2:DEPENDENCY_COLLECTION] in serial mode
> INFO  : Starting task [Stage-0:MOVE] in serial mode
> INFO  : Completed executing 
> command(queryId=root_20240222123244_0c448b32-4fd9-420d-be31-e39e2972af82); 
> Time taken: 10.534 seconds
> 100 rows affected (10.696 seconds){noformat}
>  
> *Without cluster By it fails:* 
> {noformat}
> 0: jdbc:hive2://localhost:10001/> insert into table partition_transform_4 
> select t, ts from t1;
> --
> VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED
> --
> Map 1 .. container SUCCEEDED  1  100  
>  0   0
> Reducer 2container   RUNNING  1  010  
>  2   0
> --
> VERTICES: 01/02  [=>>-] 50%   ELAPSED TIME: 9.53 s
> --
> Caused by: java.lang.IllegalStateException: Incoming records violate the 
> writer assumption that records are clustered by spec and by partition within 
> each spec. Either cluster the incoming records or switch to fanout writers.
> Encountered records that belong to already closed files:
> partition 'ts_month=2027-03' in spec [
>   1000: ts_month: month(2)
> ]
>   at org.apache.iceberg.io.ClusteredWriter.write(ClusteredWriter.java:96)
>   at 
> org.apache.iceberg.io.ClusteredDataWriter.write(ClusteredDataWriter.java:31)
>   at 
> org.apache.iceberg.mr.hive.writer.HiveIcebergRecordWriter.write(HiveIcebergRecordWriter.java:53)
>   at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:1181)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorFileSinkOperator.process(VectorFileSinkOperator.java:111)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919)
>   at 
> org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:158)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.processVectorGroup(ReduceRecordSource.java:502)
>   ... 20 more{noformat}
>  
>  
> A simple repro, using the attached csv file: 
> [^query-hive-377.csv]
> {noformat}
> create database t3;
> use t3;
> create table vector1k(
>         t int,
>         si int,
>         i int,
>         b bigint,
>         f float,
>         d double,
>         dc decimal(38,18),
>         bo boolean,
>         s string,
>         s2 string,
>         ts timestamp,
>         ts2 timestamp,
>         dt date)
>      row format delimited fields terminated by ',';
> load data local inpath "/query-hive-377.csv" OVERWRITE into table vector1k; 
> select * from vector1k; create table vectortab10k(
>         t int,
>        

[jira] [Resolved] (HIVE-25972) HIVE_VECTORIZATION_USE_ROW_DESERIALIZE in hiveconf.java imply default value is false,in fact the default value is 'true'

2024-02-28 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-25972.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Merged to master.
Thanks [~kokila19] for the contribution.

> HIVE_VECTORIZATION_USE_ROW_DESERIALIZE in hiveconf.java imply default value 
> is false,in fact the default value is 'true'
> 
>
> Key: HIVE-25972
> URL: https://issues.apache.org/jira/browse/HIVE-25972
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration, Hive
>Affects Versions: 3.1.2, 4.0.0
>Reporter: lkl
>Assignee: Kokila N
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> HIVE_VECTORIZATION_USE_ROW_DESERIALIZE in hiveconf.java imply default value 
> is false,in fact the default value is 'true'。 the code is 
> {code:java}
> HIVE_VECTORIZATION_USE_ROW_DESERIALIZE("hive.vectorized.use.row.serde.deserialize",
>  true,
> "This flag should be set to true to enable vectorizing using row 
> deserialize.\n" +
> "The default value is false."), {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27950) STACK UDTF returns wrong results when # of argument is not a multiple of N

2024-02-19 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27950:
--
Fix Version/s: 4.0.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Merged to master.
Thanks [~okumin] for the contribution and [~aturoczy] and 
[~InvisibleProgrammer] for the reviews.

> STACK UDTF returns wrong results when # of argument is not a multiple of N
> --
>
> Key: HIVE-27950
> URL: https://issues.apache.org/jira/browse/HIVE-27950
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> GenericUDTFStack nullifies a wrong cell when the number of values is 
> indivisible. In the following case, the `col2` column of the last row should 
> be `NULL`. But, `col1` is NULL somehow. 
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:1/defaul> select stack(2, 'a', 'b', 'c', 
> 'd', 'e');
> +---+---+---+
> | col0  | col1  | col2  |
> +---+---+---+
> | a     | b     | c     |
> | d     | NULL  | c     |
> +---+---+---+{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28069) Iceberg: Implement Merge task functionality for Iceberg tables

2024-02-08 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-28069:
-

 Summary: Iceberg: Implement Merge task functionality for Iceberg 
tables
 Key: HIVE-28069
 URL: https://issues.apache.org/jira/browse/HIVE-28069
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Affects Versions: 4.0.0
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


Implement Merge task functionality for Iceberg tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27637) Compare highest write ID of compaction records when trying to perform abort cleanup

2024-02-02 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813677#comment-17813677
 ] 

Sourabh Badhya commented on HIVE-27637:
---

The commit is reverted via https://github.com/apache/hive/pull/5058.

> Compare highest write ID of compaction records when trying to perform abort 
> cleanup
> ---
>
> Key: HIVE-27637
> URL: https://issues.apache.org/jira/browse/HIVE-27637
> Project: Hive
>  Issue Type: Task
>  Components: Hive
>Reporter: Zsolt Miskolczi
>Assignee: Zsolt Miskolczi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Compare highest write ID of compaction records when trying to get the 
> potential table/partitions for abort cleanup.
> Idea: If there exists a highest write ID of a record in COMPACTION_QUEUE for 
> a table/partition which is greater than the max(aborted write ID) for that 
> table/partition, then we can potentially ignore abort cleanup for such 
> tables/partitions. This is because compaction will perform cleanup of 
> obsolete deltas and aborted deltas hence doing abort cleanup is redundant 
> here.
> This is more of an optimisation since it can potentially save some filesystem 
> operations (mainly file-listing during construction of Acid state).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HIVE-27637) Compare highest write ID of compaction records when trying to perform abort cleanup

2024-02-02 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya reopened HIVE-27637:
---

> Compare highest write ID of compaction records when trying to perform abort 
> cleanup
> ---
>
> Key: HIVE-27637
> URL: https://issues.apache.org/jira/browse/HIVE-27637
> Project: Hive
>  Issue Type: Task
>  Components: Hive
>Reporter: Zsolt Miskolczi
>Assignee: Zsolt Miskolczi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Compare highest write ID of compaction records when trying to get the 
> potential table/partitions for abort cleanup.
> Idea: If there exists a highest write ID of a record in COMPACTION_QUEUE for 
> a table/partition which is greater than the max(aborted write ID) for that 
> table/partition, then we can potentially ignore abort cleanup for such 
> tables/partitions. This is because compaction will perform cleanup of 
> obsolete deltas and aborted deltas hence doing abort cleanup is redundant 
> here.
> This is more of an optimisation since it can potentially save some filesystem 
> operations (mainly file-listing during construction of Acid state).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27637) Compare highest write ID of compaction records when trying to perform abort cleanup

2024-02-02 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-27637.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Merged to master.
Thanks [~InvisibleProgrammer] for the contribution and [~aturoczy] for the 
review.

> Compare highest write ID of compaction records when trying to perform abort 
> cleanup
> ---
>
> Key: HIVE-27637
> URL: https://issues.apache.org/jira/browse/HIVE-27637
> Project: Hive
>  Issue Type: Task
>  Components: Hive
>Reporter: Zsolt Miskolczi
>Assignee: Zsolt Miskolczi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Compare highest write ID of compaction records when trying to get the 
> potential table/partitions for abort cleanup.
> Idea: If there exists a highest write ID of a record in COMPACTION_QUEUE for 
> a table/partition which is greater than the max(aborted write ID) for that 
> table/partition, then we can potentially ignore abort cleanup for such 
> tables/partitions. This is because compaction will perform cleanup of 
> obsolete deltas and aborted deltas hence doing abort cleanup is redundant 
> here.
> This is more of an optimisation since it can potentially save some filesystem 
> operations (mainly file-listing during construction of Acid state).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27938) Iceberg: Fix java.lang.ClassCastException during vectorized reads on partition columns

2024-02-01 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-27938.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Merged to master.
Thanks [~simhadri-g] for the contribution.

> Iceberg: Fix java.lang.ClassCastException during vectorized reads on 
> partition columns 
> ---
>
> Key: HIVE-27938
> URL: https://issues.apache.org/jira/browse/HIVE-27938
> Project: Hive
>  Issue Type: Bug
>Reporter: Simhadri Govindappa
>Assignee: Simhadri Govindappa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
> 1: jdbc:hive2://localhost:10001/> CREATE EXTERNAL TABLE ice3   (`col1` int, 
> `calday` date) PARTITIONED BY SPEC (calday)   stored by iceberg 
> tblproperties('format-version'='2'); 
> 1: jdbc:hive2://localhost:10001/>insert into ice3 values(1, '2020-11-20'); 
> 1: jdbc:hive2://localhost:10001/> select count(calday) from ice3;
> {code}
> Full stack trace: 
> {code:java}
> INFO  : Compiling 
> command(queryId=root_20231206184246_e8da1539-7537-45fe-af67-4c7ba219feab): 
> select count(calday) from ice3INFO  : No Stats for default@ice3, Columns: 
> caldayINFO  : Semantic Analysis Completed (retrial = false)INFO  : Created 
> Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, 
> comment:null)], properties:null)INFO  : Completed compiling 
> command(queryId=root_20231206184246_e8da1539-7537-45fe-af67-4c7ba219feab); 
> Time taken: 0.196 secondsINFO  : Operation QUERY obtained 0 locksINFO  : 
> Executing 
> command(queryId=root_20231206184246_e8da1539-7537-45fe-af67-4c7ba219feab): 
> select count(calday) from ice3INFO  : Query ID = 
> root_20231206184246_e8da1539-7537-45fe-af67-4c7ba219feabINFO  : Total jobs = 
> 1INFO  : Launching Job 1 out of 1INFO  : Starting task [Stage-1:MAPRED] in 
> serial modeINFO  : Subscribed to counters: [] for queryId: 
> root_20231206184246_e8da1539-7537-45fe-af67-4c7ba219feabINFO  : Session is 
> already openINFO  : Dag name: select count(calday) from ice3 (Stage-1)INFO  : 
> HS2 Host: [localhost], Query ID: 
> [root_20231206184246_e8da1539-7537-45fe-af67-4c7ba219feab], Dag ID: 
> [dag_1701888162260_0001_2], DAG Session ID: 
> [application_1701888162260_0001]INFO  : Status: Running (Executing on YARN 
> cluster with App id application_1701888162260_0001)
> --
>         VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  
> KILLED--Map
>  1            container       RUNNING      1          0        0        1     
>   4       0Reducer 2        container        INITED      1          0        
> 0        1       0       
> 0--VERTICES:
>  00/02  [>>--] 0%    ELAPSED TIME: 1.41 
> s--ERROR
>  : Status: FailedERROR : Vertex failed, vertexName=Map 1, 
> vertexId=vertex_1701888162260_0001_2_00, diagnostics=[Task failed, 
> taskId=task_1701888162260_0001_2_00_00, diagnostics=[TaskAttempt 0 
> failed, info=[Error: Error while running task ( failure ) : 
> attempt_1701888162260_0001_2_00_00_0:java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: 
> java.lang.ClassCastException: java.time.LocalDate cannot be cast to 
> org.apache.hadoop.hive.common.type.Dateat 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:348)
>  at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:276)   
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:381)
>  at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:82)
>at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:69)
>at java.security.AccessController.doPrivileged(Native Method)   at 
> javax.security.auth.Subject.doAs(Subject.java:422)   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>  at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:69)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:39)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)  
> at 
> 

[jira] [Resolved] (HIVE-28025) Fix flaky test iceberg_insert_overwrite_partition_transforms.q

2024-01-24 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-28025.
---
Fix Version/s: Not Applicable
   Resolution: Duplicate

> Fix flaky test iceberg_insert_overwrite_partition_transforms.q
> --
>
> Key: HIVE-28025
> URL: https://issues.apache.org/jira/browse/HIVE-28025
> Project: Hive
>  Issue Type: Test
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
> Fix For: Not Applicable
>
>
> The totalSize field in describe table command is very close to a 1000 and the 
> size can hover around 1000 hence causing a whitespace to be missing. The 
> problem is that the number representing the total size can be a 3-digit 
> number / 4-digit number. The rule masking the totalSize needs to be changed 
> in order to avoid such flakiness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28025) Fix flaky test iceberg_insert_overwrite_partition_transforms.q

2024-01-24 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-28025:
-

 Summary: Fix flaky test 
iceberg_insert_overwrite_partition_transforms.q
 Key: HIVE-28025
 URL: https://issues.apache.org/jira/browse/HIVE-28025
 Project: Hive
  Issue Type: Test
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


The totalSize field in describe table command is very close to a 1000 and the 
size can hover around 1000 hence causing a whitespace to be missing. The 
problem is that the number representing the total size can be a 3-digit number 
/ 4-digit number. The rule masking the totalSize needs to be changed in order 
to avoid such flakiness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27991) Utilise FanoutWriters when inserting records in an Iceberg table when the records are unsorted

2024-01-22 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809460#comment-17809460
 ] 

Sourabh Badhya commented on HIVE-27991:
---

Merged to master.
Thanks [~zhangbutao] for the review.

> Utilise FanoutWriters when inserting records in an Iceberg table when the 
> records are unsorted
> --
>
> Key: HIVE-27991
> URL: https://issues.apache.org/jira/browse/HIVE-27991
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
>
> FanoutWriter is a writer in Iceberg which can be used to write records in a 
> table. This writer keeps all the file handles open, until the write is 
> finished. FanoutWriters is used as the writer when the incoming records are 
> unsorted. We can by default have some mechanism to switch to using 
> FanoutWriters instead of ClusteredWriters when custom sort expressions are 
> not present for the given table/query.
> Similar stuff is already implemented in Spark - 
> [https://github.com/apache/iceberg/pull/8621]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27991) Utilise FanoutWriters when inserting records in an Iceberg table when the records are unsorted

2024-01-22 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-27991.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

> Utilise FanoutWriters when inserting records in an Iceberg table when the 
> records are unsorted
> --
>
> Key: HIVE-27991
> URL: https://issues.apache.org/jira/browse/HIVE-27991
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> FanoutWriter is a writer in Iceberg which can be used to write records in a 
> table. This writer keeps all the file handles open, until the write is 
> finished. FanoutWriters is used as the writer when the incoming records are 
> unsorted. We can by default have some mechanism to switch to using 
> FanoutWriters instead of ClusteredWriters when custom sort expressions are 
> not present for the given table/query.
> Similar stuff is already implemented in Spark - 
> [https://github.com/apache/iceberg/pull/8621]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27749) SchemaTool initSchema fails on Mariadb 10.2

2024-01-18 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-27749.
---
Resolution: Fixed

Merged the addendum PR to master.
Thanks [~aturoczy] and [~dkuzmenko] for the reviews.

> SchemaTool initSchema fails on Mariadb 10.2
> ---
>
> Key: HIVE-27749
> URL: https://issues.apache.org/jira/browse/HIVE-27749
> Project: Hive
>  Issue Type: Bug
>  Components: Standalone Metastore
>Affects Versions: 4.0.0-alpha-2, 4.0.0-beta-1
>Reporter: Stamatis Zampetakis
>Assignee: Sourabh Badhya
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: mariadb-metastore-schema-tests.patch
>
>
> Schema initialization for 4.0.0-beta-1 fails when run on Mariadb 10.2.
> The problem is reproducible on current 
> (e5a7ce2f091da1f8a324da6e489cda59b9e4bfc6) master by applying the 
> [^mariadb-metastore-schema-tests.patch] and then running:
> {noformat}
> mvn test -Dtest=TestMariadb#install -Dtest.groups=""{noformat}
> The error is shown below:
> {noformat}
> 315/409  ALTER TABLE `NOTIFICATION_SEQUENCE` MODIFY COLUMN `NNI_ID` 
> BIGINT(20) GENERATED ALWAYS AS (1) STORED NOT NULL;
> Error: (conn=11) You have an error in your SQL syntax; check the manual that 
> corresponds to your MariaDB server version for the right syntax to use near 
> 'NOT NULL' at line 1 (state=42000,code=1064)
> Aborting command set because "force" is false and command failed: "ALTER 
> TABLE `NOTIFICATION_SEQUENCE` MODIFY COLUMN `NNI_ID` BIGINT(20) GENERATED 
> ALWAYS AS (1) STORED NOT NULL;"
> [ERROR] 2023-09-27 21:36:30.317 [main] MetastoreSchemaTool - Schema 
> initialization FAILED! Metastore state would be inconsistent!
> Schema initialization FAILED! Metastore state would be inconsistent!
> [ERROR] 2023-09-27 21:36:30.317 [main] MetastoreSchemaTool - Underlying 
> cause: java.io.IOException : Schema script failed, errorcode OTHER
> Underlying cause: java.io.IOException : Schema script failed, errorcode OTHER
> org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization 
> FAILED! Metastore state would be inconsistent!
> at 
> org.apache.hadoop.hive.metastore.tools.schematool.SchemaToolTaskInit.execute(SchemaToolTaskInit.java:66)
> at 
> org.apache.hadoop.hive.metastore.tools.schematool.MetastoreSchemaTool.run(MetastoreSchemaTool.java:480)
> at 
> org.apache.hadoop.hive.metastore.tools.schematool.MetastoreSchemaTool.run(MetastoreSchemaTool.java:425)
> at 
> org.apache.hadoop.hive.metastore.dbinstall.rules.DatabaseRule.installLatest(DatabaseRule.java:269)
> at 
> org.apache.hadoop.hive.metastore.dbinstall.DbInstallBase.install(DbInstallBase.java:34)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> at 
> 

[jira] [Created] (HIVE-27991) Utilise FanoutWriters when inserting records in an Iceberg table when the records are unsorted

2024-01-09 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-27991:
-

 Summary: Utilise FanoutWriters when inserting records in an 
Iceberg table when the records are unsorted
 Key: HIVE-27991
 URL: https://issues.apache.org/jira/browse/HIVE-27991
 Project: Hive
  Issue Type: Improvement
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


FanoutWriter is a writer in Iceberg which can be used to write records in a 
table. This writer keeps all the file handles open, until the write is 
finished. FanoutWriters is used as the writer when the incoming records are 
unsorted. We can by default have some mechanism to switch to using 
FanoutWriters instead of ClusteredWriters when custom sort expressions are not 
present for the given table/query.
Similar stuff is already implemented in Spark - 
[https://github.com/apache/iceberg/pull/8621]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27925) HiveConf: unify ConfVars enum and use underscore for better readability

2024-01-03 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-27925.
---
Resolution: Fixed

Merged to master.
Thanks [~kokila19] for the patch and [~abstractdog] for the review.

> HiveConf: unify ConfVars enum and use underscore for better readability 
> 
>
> Key: HIVE-27925
> URL: https://issues.apache.org/jira/browse/HIVE-27925
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: Kokila N
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When I read something like 
> "[BASICSTATSTASKSMAXTHREADSFACTOR|https://github.com/apache/hive/blob/70f34e27349dccf5fabbfc6c63e63c7be0785360/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L753];
>  I feel someone in the world laughs out loud thinking of me struggling. I can 
> read it, but I hate it :) imagine what if we have vars like 
> [HIVE_MATERIALIZED_VIEW_ENABLE_AUTO_REWRITING_SUBQUERY_SQL|https://github.com/apache/hive/blob/70f34e27349dccf5fabbfc6c63e63c7be0785360/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1921]
>  without underscores...okay, let me help, it is: 
> HIVEMATERIALIZEDVIEWENABLEAUTOREWRITINGSUBQUERYSQL :D
> please let's fix this in 4.0.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27749) SchemaTool initSchema fails on Mariadb 10.2

2023-12-18 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798190#comment-17798190
 ] 

Sourabh Badhya commented on HIVE-27749:
---

Thanks [~dkuzmenko] and [~InvisibleProgrammer] for the reviews.

> SchemaTool initSchema fails on Mariadb 10.2
> ---
>
> Key: HIVE-27749
> URL: https://issues.apache.org/jira/browse/HIVE-27749
> Project: Hive
>  Issue Type: Bug
>  Components: Standalone Metastore
>Affects Versions: 4.0.0-beta-1
>Reporter: Stamatis Zampetakis
>Assignee: Sourabh Badhya
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: mariadb-metastore-schema-tests.patch
>
>
> Schema initialization for 4.0.0-beta-1 fails when run on Mariadb 10.2.
> The problem is reproducible on current 
> (e5a7ce2f091da1f8a324da6e489cda59b9e4bfc6) master by applying the 
> [^mariadb-metastore-schema-tests.patch] and then running:
> {noformat}
> mvn test -Dtest=TestMariadb#install -Dtest.groups=""{noformat}
> The error is shown below:
> {noformat}
> 315/409  ALTER TABLE `NOTIFICATION_SEQUENCE` MODIFY COLUMN `NNI_ID` 
> BIGINT(20) GENERATED ALWAYS AS (1) STORED NOT NULL;
> Error: (conn=11) You have an error in your SQL syntax; check the manual that 
> corresponds to your MariaDB server version for the right syntax to use near 
> 'NOT NULL' at line 1 (state=42000,code=1064)
> Aborting command set because "force" is false and command failed: "ALTER 
> TABLE `NOTIFICATION_SEQUENCE` MODIFY COLUMN `NNI_ID` BIGINT(20) GENERATED 
> ALWAYS AS (1) STORED NOT NULL;"
> [ERROR] 2023-09-27 21:36:30.317 [main] MetastoreSchemaTool - Schema 
> initialization FAILED! Metastore state would be inconsistent!
> Schema initialization FAILED! Metastore state would be inconsistent!
> [ERROR] 2023-09-27 21:36:30.317 [main] MetastoreSchemaTool - Underlying 
> cause: java.io.IOException : Schema script failed, errorcode OTHER
> Underlying cause: java.io.IOException : Schema script failed, errorcode OTHER
> org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization 
> FAILED! Metastore state would be inconsistent!
> at 
> org.apache.hadoop.hive.metastore.tools.schematool.SchemaToolTaskInit.execute(SchemaToolTaskInit.java:66)
> at 
> org.apache.hadoop.hive.metastore.tools.schematool.MetastoreSchemaTool.run(MetastoreSchemaTool.java:480)
> at 
> org.apache.hadoop.hive.metastore.tools.schematool.MetastoreSchemaTool.run(MetastoreSchemaTool.java:425)
> at 
> org.apache.hadoop.hive.metastore.dbinstall.rules.DatabaseRule.installLatest(DatabaseRule.java:269)
> at 
> org.apache.hadoop.hive.metastore.dbinstall.DbInstallBase.install(DbInstallBase.java:34)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> at 
> 

[jira] [Updated] (HIVE-27824) Upgrade ivy to 2.5.2 and htmlunit to 2.70.0

2023-12-17 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27824:
--
Summary: Upgrade ivy to 2.5.2 and htmlunit to 2.70.0  (was: Upgrade Ivy to 
2.5.2)

> Upgrade ivy to 2.5.2 and htmlunit to 2.70.0
> ---
>
> Key: HIVE-27824
> URL: https://issues.apache.org/jira/browse/HIVE-27824
> Project: Hive
>  Issue Type: Task
>  Components: Hive
>Reporter: Devaspati Krishnatri
>Assignee: Devaspati Krishnatri
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: mvn_dependency_tree.txt
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27824) Upgrade Ivy to 2.5.2

2023-12-17 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-27824.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Merged to master.
Thanks [~devaspatikrishnatri] for the patch and [~aturoczy] for the review.

> Upgrade Ivy to 2.5.2
> 
>
> Key: HIVE-27824
> URL: https://issues.apache.org/jira/browse/HIVE-27824
> Project: Hive
>  Issue Type: Task
>  Components: Hive
>Reporter: Devaspati Krishnatri
>Assignee: Devaspati Krishnatri
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: mvn_dependency_tree.txt
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HIVE-27749) SchemaTool initSchema fails on Mariadb 10.2

2023-12-14 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796694#comment-17796694
 ] 

Sourabh Badhya edited comment on HIVE-27749 at 12/14/23 11:46 AM:
--

[~ngangam] , as I mentioned this alternate query is supported in MariaDB 
version 10.2. (I validated this query on a MariaDB 10.2 SQL environment).
{code:java}
ALTER TABLE `NOTIFICATION_SEQUENCE` ADD CONSTRAINT `ONE_ROW_CONSTRAINT` CHECK 
(`NNI_ID` = 1); {code}
The only problem with this query is that on MySQL 5.7 (which is EOL currently) 
this query is parsed but not enforced. The enforcing of the constraint happens 
post MySQL 8.0.16 (Next major MySQL version). This query is a better 
alternative that the current query which doesn't run on MariaDB. 

Prior to this patch, there was no constraint whatsover on this column. 


was (Author: JIRAUSER287127):
[~ngangam] , as I mentioned this alternate query is supported in MariaDB 
version 10.2. (I validated this query on a MariaDB 10.2 SQL environment).
{code:java}
ALTER TABLE `NOTIFICATION_SEQUENCE` ADD CONSTRAINT `ONE_ROW_CONSTRAINT` CHECK 
(`NNI_ID` = 1); {code}
The only problem with this query is that on MySQL 5.7 (which is EOL currently) 
this query is parsed but not enforced. The enforcing of the constraint happens 
post MySQL 8.0.16 (Next major MySQL version). This query is a better 
alternative that the current query which doesn't run on MariaDB.

> SchemaTool initSchema fails on Mariadb 10.2
> ---
>
> Key: HIVE-27749
> URL: https://issues.apache.org/jira/browse/HIVE-27749
> Project: Hive
>  Issue Type: Bug
>  Components: Standalone Metastore
>Affects Versions: 4.0.0-beta-1
>Reporter: Stamatis Zampetakis
>Assignee: Naveen Gangam
>Priority: Major
> Attachments: mariadb-metastore-schema-tests.patch
>
>
> Schema initialization for 4.0.0-beta-1 fails when run on Mariadb 10.2.
> The problem is reproducible on current 
> (e5a7ce2f091da1f8a324da6e489cda59b9e4bfc6) master by applying the 
> [^mariadb-metastore-schema-tests.patch] and then running:
> {noformat}
> mvn test -Dtest=TestMariadb#install -Dtest.groups=""{noformat}
> The error is shown below:
> {noformat}
> 315/409  ALTER TABLE `NOTIFICATION_SEQUENCE` MODIFY COLUMN `NNI_ID` 
> BIGINT(20) GENERATED ALWAYS AS (1) STORED NOT NULL;
> Error: (conn=11) You have an error in your SQL syntax; check the manual that 
> corresponds to your MariaDB server version for the right syntax to use near 
> 'NOT NULL' at line 1 (state=42000,code=1064)
> Aborting command set because "force" is false and command failed: "ALTER 
> TABLE `NOTIFICATION_SEQUENCE` MODIFY COLUMN `NNI_ID` BIGINT(20) GENERATED 
> ALWAYS AS (1) STORED NOT NULL;"
> [ERROR] 2023-09-27 21:36:30.317 [main] MetastoreSchemaTool - Schema 
> initialization FAILED! Metastore state would be inconsistent!
> Schema initialization FAILED! Metastore state would be inconsistent!
> [ERROR] 2023-09-27 21:36:30.317 [main] MetastoreSchemaTool - Underlying 
> cause: java.io.IOException : Schema script failed, errorcode OTHER
> Underlying cause: java.io.IOException : Schema script failed, errorcode OTHER
> org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization 
> FAILED! Metastore state would be inconsistent!
> at 
> org.apache.hadoop.hive.metastore.tools.schematool.SchemaToolTaskInit.execute(SchemaToolTaskInit.java:66)
> at 
> org.apache.hadoop.hive.metastore.tools.schematool.MetastoreSchemaTool.run(MetastoreSchemaTool.java:480)
> at 
> org.apache.hadoop.hive.metastore.tools.schematool.MetastoreSchemaTool.run(MetastoreSchemaTool.java:425)
> at 
> org.apache.hadoop.hive.metastore.dbinstall.rules.DatabaseRule.installLatest(DatabaseRule.java:269)
> at 
> org.apache.hadoop.hive.metastore.dbinstall.DbInstallBase.install(DbInstallBase.java:34)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> at 
> 

[jira] [Commented] (HIVE-27749) SchemaTool initSchema fails on Mariadb 10.2

2023-12-14 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796694#comment-17796694
 ] 

Sourabh Badhya commented on HIVE-27749:
---

[~ngangam] , as I mentioned this alternate query is supported in MariaDB 
version 10.2. (I validated this query on a MariaDB 10.2 SQL environment).
{code:java}
ALTER TABLE `NOTIFICATION_SEQUENCE` ADD CONSTRAINT `ONE_ROW_CONSTRAINT` CHECK 
(`NNI_ID` = 1); {code}
The only problem with this query is that on MySQL 5.7 (which is EOL currently) 
this query is parsed but not enforced. The enforcing of the constraint happens 
post MySQL 8.0.16 (Next major MySQL version). This query is a better 
alternative that the current query which doesn't run on MariaDB.

> SchemaTool initSchema fails on Mariadb 10.2
> ---
>
> Key: HIVE-27749
> URL: https://issues.apache.org/jira/browse/HIVE-27749
> Project: Hive
>  Issue Type: Bug
>  Components: Standalone Metastore
>Affects Versions: 4.0.0-beta-1
>Reporter: Stamatis Zampetakis
>Assignee: Naveen Gangam
>Priority: Major
> Attachments: mariadb-metastore-schema-tests.patch
>
>
> Schema initialization for 4.0.0-beta-1 fails when run on Mariadb 10.2.
> The problem is reproducible on current 
> (e5a7ce2f091da1f8a324da6e489cda59b9e4bfc6) master by applying the 
> [^mariadb-metastore-schema-tests.patch] and then running:
> {noformat}
> mvn test -Dtest=TestMariadb#install -Dtest.groups=""{noformat}
> The error is shown below:
> {noformat}
> 315/409  ALTER TABLE `NOTIFICATION_SEQUENCE` MODIFY COLUMN `NNI_ID` 
> BIGINT(20) GENERATED ALWAYS AS (1) STORED NOT NULL;
> Error: (conn=11) You have an error in your SQL syntax; check the manual that 
> corresponds to your MariaDB server version for the right syntax to use near 
> 'NOT NULL' at line 1 (state=42000,code=1064)
> Aborting command set because "force" is false and command failed: "ALTER 
> TABLE `NOTIFICATION_SEQUENCE` MODIFY COLUMN `NNI_ID` BIGINT(20) GENERATED 
> ALWAYS AS (1) STORED NOT NULL;"
> [ERROR] 2023-09-27 21:36:30.317 [main] MetastoreSchemaTool - Schema 
> initialization FAILED! Metastore state would be inconsistent!
> Schema initialization FAILED! Metastore state would be inconsistent!
> [ERROR] 2023-09-27 21:36:30.317 [main] MetastoreSchemaTool - Underlying 
> cause: java.io.IOException : Schema script failed, errorcode OTHER
> Underlying cause: java.io.IOException : Schema script failed, errorcode OTHER
> org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization 
> FAILED! Metastore state would be inconsistent!
> at 
> org.apache.hadoop.hive.metastore.tools.schematool.SchemaToolTaskInit.execute(SchemaToolTaskInit.java:66)
> at 
> org.apache.hadoop.hive.metastore.tools.schematool.MetastoreSchemaTool.run(MetastoreSchemaTool.java:480)
> at 
> org.apache.hadoop.hive.metastore.tools.schematool.MetastoreSchemaTool.run(MetastoreSchemaTool.java:425)
> at 
> org.apache.hadoop.hive.metastore.dbinstall.rules.DatabaseRule.installLatest(DatabaseRule.java:269)
> at 
> org.apache.hadoop.hive.metastore.dbinstall.DbInstallBase.install(DbInstallBase.java:34)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> at 

[jira] [Commented] (HIVE-27749) SchemaTool initSchema fails on Mariadb 10.2

2023-12-06 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17794033#comment-17794033
 ] 

Sourabh Badhya commented on HIVE-27749:
---

[~ngangam] I was able to reproduce the issue as well. Seems like the claim that 
MariaDB is compatible with MySQL syntax is incorrect. 
A possible bug Jira in MariaDB which is similar to the issue being faced - 
[https://jira.mariadb.org/browse/MDEV-10964]
At the time of writing this patch, I had only 2 options, either add a check 
constraint on this column or use generated columns for MySQL.
The alternate query (CHECK constraint query) is - 
{code:java}
ALTER TABLE `NOTIFICATION_SEQUENCE` ADD CONSTRAINT `ONE_ROW_CONSTRAINT` CHECK 
(`NNI_ID` = 1);{code}
AFAIK MariaDB supports CHECK constraint in 10.2 (validated this as well) but 
MySQL 5.7 parses CHECK constraint but doesn't enforce it. The enforcing of 
constraint is present in 8.0.16 (next major release).
Source - [https://dev.mysql.com/doc/refman/5.7/en/create-table.html]

Its worth noting that some of the database versions being discussed here are 
EOL versions like MariaDB 10.2 and MySQL 5.7.

> SchemaTool initSchema fails on Mariadb 10.2
> ---
>
> Key: HIVE-27749
> URL: https://issues.apache.org/jira/browse/HIVE-27749
> Project: Hive
>  Issue Type: Bug
>  Components: Standalone Metastore
>Affects Versions: 4.0.0-beta-1
>Reporter: Stamatis Zampetakis
>Assignee: Naveen Gangam
>Priority: Major
> Attachments: mariadb-metastore-schema-tests.patch
>
>
> Schema initialization for 4.0.0-beta-1 fails when run on Mariadb 10.2.
> The problem is reproducible on current 
> (e5a7ce2f091da1f8a324da6e489cda59b9e4bfc6) master by applying the 
> [^mariadb-metastore-schema-tests.patch] and then running:
> {noformat}
> mvn test -Dtest=TestMariadb#install -Dtest.groups=""{noformat}
> The error is shown below:
> {noformat}
> 315/409  ALTER TABLE `NOTIFICATION_SEQUENCE` MODIFY COLUMN `NNI_ID` 
> BIGINT(20) GENERATED ALWAYS AS (1) STORED NOT NULL;
> Error: (conn=11) You have an error in your SQL syntax; check the manual that 
> corresponds to your MariaDB server version for the right syntax to use near 
> 'NOT NULL' at line 1 (state=42000,code=1064)
> Aborting command set because "force" is false and command failed: "ALTER 
> TABLE `NOTIFICATION_SEQUENCE` MODIFY COLUMN `NNI_ID` BIGINT(20) GENERATED 
> ALWAYS AS (1) STORED NOT NULL;"
> [ERROR] 2023-09-27 21:36:30.317 [main] MetastoreSchemaTool - Schema 
> initialization FAILED! Metastore state would be inconsistent!
> Schema initialization FAILED! Metastore state would be inconsistent!
> [ERROR] 2023-09-27 21:36:30.317 [main] MetastoreSchemaTool - Underlying 
> cause: java.io.IOException : Schema script failed, errorcode OTHER
> Underlying cause: java.io.IOException : Schema script failed, errorcode OTHER
> org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization 
> FAILED! Metastore state would be inconsistent!
> at 
> org.apache.hadoop.hive.metastore.tools.schematool.SchemaToolTaskInit.execute(SchemaToolTaskInit.java:66)
> at 
> org.apache.hadoop.hive.metastore.tools.schematool.MetastoreSchemaTool.run(MetastoreSchemaTool.java:480)
> at 
> org.apache.hadoop.hive.metastore.tools.schematool.MetastoreSchemaTool.run(MetastoreSchemaTool.java:425)
> at 
> org.apache.hadoop.hive.metastore.dbinstall.rules.DatabaseRule.installLatest(DatabaseRule.java:269)
> at 
> org.apache.hadoop.hive.metastore.dbinstall.DbInstallBase.install(DbInstallBase.java:34)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)

[jira] [Resolved] (HIVE-27918) Iceberg: Push transforms for clustering during table writes

2023-12-05 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-27918.
---
Fix Version/s: 4.1.0
   Resolution: Fixed

Merged to master.
Thanks [~dkuzmenko] for the reviews.

> Iceberg: Push transforms for clustering during table writes
> ---
>
> Key: HIVE-27918
> URL: https://issues.apache.org/jira/browse/HIVE-27918
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> Currently transformed columns (except for bucket transform) are not pushed / 
> passed as clustering columns. This can lead to incorrect clustering on such 
> columns which can lead non-performant writes.
> Hence push transforms for clustering during table writes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27918) Iceberg: Push transforms for clustering during table writes

2023-11-29 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-27918:
-

 Summary: Iceberg: Push transforms for clustering during table 
writes
 Key: HIVE-27918
 URL: https://issues.apache.org/jira/browse/HIVE-27918
 Project: Hive
  Issue Type: Improvement
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


Currently transformed columns (except for bucket transform) are not pushed / 
passed as clustering columns. This can lead to incorrect clustering on such 
columns which can lead non-performant writes.

Hence push transforms for clustering during table writes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27779) Iceberg: Drop partition support

2023-11-04 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-27779.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Merged to master.
Thanks [~dkuzmenko] for the review.

> Iceberg: Drop partition support
> ---
>
> Key: HIVE-27779
> URL: https://issues.apache.org/jira/browse/HIVE-27779
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> A logical extension of TRUNCATE PARTITION however, DROP PARTITION allows 
> expressions with >, <, >=, <=, != etc.
> The syntax is something as follows - 
> {code:java}
> alter table tableName drop partition (c='US', d<'2');{code}
> Drop partition command also allows multiple partition expressions as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27731) Perform metadata delete when only static filters are present

2023-10-19 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-27731.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Merged to master.
Thanks [~dkuzmenko] and [~kkasa] for the reviews.

> Perform metadata delete when only static filters are present
> 
>
> Key: HIVE-27731
> URL: https://issues.apache.org/jira/browse/HIVE-27731
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When the query has static filters only, try to perform a metadata delete 
> directly rather than moving forward with positional delete.
> Some relevant use cases where metadata deletes can be used - 
> {code:java}
> DELETE FROM ice_table where id = 1;{code}
> As seen above only filter is (id = 1). If in scenarios wherein the filter 
> corresponds to a partition column then metadata delete is more efficient and 
> does not generate additional files.
> For partition evolution cases, if it is not possible to perform metadata 
> delete then positional delete is done.
> Some other optimisations that can be seen here is utilizing vectorized 
> expressions for UDFs which provide vectorized expressions such as year - 
> {code:java}
> DELETE FROM ice_table where id = 1 AND year(datecol) = 2015;{code}
> Delete query with Multi-table scans will not optimized using this method 
> since determination of where clauses happens at runtime.
> A similar optimisation is seen in Spark where metadata delete is done 
> whenever possible- 
> [https://github.com/apache/iceberg/blob/master/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java#L297-L389]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27779) Iceberg: Drop partition support

2023-10-09 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-27779:
-

 Summary: Iceberg: Drop partition support
 Key: HIVE-27779
 URL: https://issues.apache.org/jira/browse/HIVE-27779
 Project: Hive
  Issue Type: Improvement
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


A logical extension of TRUNCATE PARTITION however, DROP PARTITION allows 
expressions with >, <, >=, <=, != etc.
The syntax is something as follows - 
{code:java}
alter table tableName drop partition (c='US', d<'2');{code}
Drop partition command also allows multiple partition expressions as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-26455) Remove PowerMockito from hive-exec

2023-10-05 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-26455.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Merged to master.
Thanks [~InvisibleProgrammer] for the contribution and [~ayushtkn] , [~rkirtir] 
, [~zratkai] for the reviews.

> Remove PowerMockito from hive-exec
> --
>
> Key: HIVE-26455
> URL: https://issues.apache.org/jira/browse/HIVE-26455
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: Zsolt Miskolczi
>Assignee: Zsolt Miskolczi
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> PowerMockito is a mockito extension that introduces some painful points. 
> The main intention behind that is to be able to do static mocking. Since its 
> release, mockito-inline has been released, as a part of the mockito-core. 
> It doesn't require vintage test runner to be able to run and it can mock 
> objects with their own thread. 
> The goal is to stop using PowerMockito and use mockito-inline instead.
>  
> The affected packages are: 
>  * org.apache.hadoop.hive.ql.exec.repl
>  * org.apache.hadoop.hive.ql.exec.repl.bootstrap.load
>  * org.apache.hadoop.hive.ql.exec.repl.ranger;
>  * org.apache.hadoop.hive.ql.exec.util
>  * org.apache.hadoop.hive.ql.parse.repl
>  * org.apache.hadoop.hive.ql.parse.repl.load.message
>  * org.apache.hadoop.hive.ql.parse.repl.metric
>  * org.apache.hadoop.hive.ql.txn.compactor
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27672) Iceberg: Truncate partition support

2023-09-27 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-27672.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Merged to master. Thanks [~kkasa] and [~dkuzmenko] for the reviews.

> Iceberg: Truncate partition support
> ---
>
> Key: HIVE-27672
> URL: https://issues.apache.org/jira/browse/HIVE-27672
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Support the following truncate operations on a partition level - 
> {code:java}
> TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
> partValue2);{code}
> Truncate is not supported for partition transforms.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27731) Perform metadata delete when only static filters are present

2023-09-26 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-27731:
-

 Summary: Perform metadata delete when only static filters are 
present
 Key: HIVE-27731
 URL: https://issues.apache.org/jira/browse/HIVE-27731
 Project: Hive
  Issue Type: Improvement
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


When the query has static filters only, try to perform a metadata delete 
directly rather than moving forward with positional delete.

Some relevant use cases where metadata deletes can be used - 
{code:java}
DELETE FROM ice_table where id = 1;{code}
As seen above only filter is (id = 1). If in scenarios wherein the filter 
corresponds to a partition column then metadata delete is more efficient and 
does not generate additional files.

For partition evolution cases, if it is not possible to perform metadata delete 
then positional delete is done.

Some other optimisations that can be seen here is utilizing vectorized 
expressions for UDFs which provide vectorized expressions such as year - 
{code:java}
DELETE FROM ice_table where id = 1 AND year(datecol) = 2015;{code}
Delete query with Multi-table scans will not optimized using this method since 
determination of where clauses happens at runtime.

A similar optimisation is seen in Spark where metadata delete is done whenever 
possible- 
[https://github.com/apache/iceberg/blob/master/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java#L297-L389]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27672) Iceberg: Truncate partition support

2023-09-12 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27672:
--
Description: 
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
Truncate is not supported for partition transforms.

  was:
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07', c_trunc = 'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the lower bound and upper bound of the partition column of the 
existing files (files prior to partition evolution). If the newly added 
partition is within the lower bound and upper bound of the partition column of 
the existing files then performing truncate operation on the newly added 
partition throws a ValidationException.


> Iceberg: Truncate partition support
> ---
>
> Key: HIVE-27672
> URL: https://issues.apache.org/jira/browse/HIVE-27672
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> Support the following truncate operations on a partition level - 
> {code:java}
> TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
> partValue2);{code}
> Truncate is not supported for partition transforms.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27672) Iceberg: Truncate partition support

2023-09-06 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27672:
--
Description: 
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07', c_trunc = 'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the lower bound and upper bound of the partition column of the 
existing files (files prior to partition evolution). If the newly added 
partition is within the lower bound and upper bound of the partition column of 
the existing files then performing truncate operation on the newly added 
partition throws a ValidationException.

  was:
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07', c_trunc = 'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the lower bound and upper bound of the partition column of the 
existing files. If the newly added partition is within the lower bound and 
upper bound of the partition column of the existing files then performing 
truncate operation on the newly added partition throws a ValidationException.


> Iceberg: Truncate partition support
> ---
>
> Key: HIVE-27672
> URL: https://issues.apache.org/jira/browse/HIVE-27672
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> Support the following truncate operations on a partition level - 
> {code:java}
> TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
> partValue2);{code}
> For partition transforms other than identity, the partition column must have 
> a suffix to the column as follows - 
> 1. Truncate transform on 'b' column - b_trunc
> {code:java}
> TRUNCATE TABLE tableName 

[jira] [Updated] (HIVE-27672) Iceberg: Truncate partition support

2023-09-06 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27672:
--
Description: 
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07', c_trunc = 'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the lower bound and upper bound of the partition column of the 
existing files. If the newly added partition is within the lower bound and 
upper bound of the partition column of the existing files then performing 
truncate operation on the newly added partition throws a ValidationException.

  was:
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07', c_trunc = 'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the lower bound and upper bound of the partition column of the 
existing files. If the newly added partition is within the lower bound and 
upper bound of the existing files then performing truncate operation on the 
newly added partition throws a ValidationException.


> Iceberg: Truncate partition support
> ---
>
> Key: HIVE-27672
> URL: https://issues.apache.org/jira/browse/HIVE-27672
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> Support the following truncate operations on a partition level - 
> {code:java}
> TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
> partValue2);{code}
> For partition transforms other than identity, the partition column must have 
> a suffix to the column as follows - 
> 1. Truncate transform on 'b' column - b_trunc
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
> 2. Bucket transform on 'b' column 

[jira] [Updated] (HIVE-27672) Iceberg: Truncate partition support

2023-09-06 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27672:
--
Description: 
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07', c_trunc = 'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the lower bound and upper bound of the partition column of the 
existing files. If the newly added partition is within the lower bound and 
upper bound of the existing files then performing truncate operation on the 
newly added partition throws a ValidationException.

  was:
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07', c_trunc = 'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the lower bound and upper bound of the partition column of the 
existing files. If the newly added partition is within the lower bound and 
upper bound of the existing files then a ValidationException is thrown.


> Iceberg: Truncate partition support
> ---
>
> Key: HIVE-27672
> URL: https://issues.apache.org/jira/browse/HIVE-27672
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> Support the following truncate operations on a partition level - 
> {code:java}
> TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
> partValue2);{code}
> For partition transforms other than identity, the partition column must have 
> a suffix to the column as follows - 
> 1. Truncate transform on 'b' column - b_trunc
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
> 2. Bucket transform on 'b' column - b_bucket
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_bucket = 

[jira] [Updated] (HIVE-27672) Iceberg: Truncate partition support

2023-09-06 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27672:
--
Description: 
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07', c_trunc = 'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the lower_bound and upper_bound of the partition column of the 
existing files. If the newly added partition is within the upper bound and 
lower bound of the existing files then a ValidationException is thrown.

  was:
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07', c_trunc = 'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the upper bound and lower bound of the existing files. If the newly 
added partition is within the upper bound and lower bound of the existing files 
then a ValidationException is thrown.


> Iceberg: Truncate partition support
> ---
>
> Key: HIVE-27672
> URL: https://issues.apache.org/jira/browse/HIVE-27672
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> Support the following truncate operations on a partition level - 
> {code:java}
> TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
> partValue2);{code}
> For partition transforms other than identity, the partition column must have 
> a suffix to the column as follows - 
> 1. Truncate transform on 'b' column - b_trunc
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
> 2. Bucket transform on 'b' column - b_bucket
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
> 3. Year transform on 'b' column - b_year - The value should be in  

[jira] [Updated] (HIVE-27672) Iceberg: Truncate partition support

2023-09-06 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27672:
--
Description: 
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07', c_trunc = 'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the lower bound and upper bound of the partition column of the 
existing files. If the newly added partition is within the lower bound and 
upper bound of the existing files then a ValidationException is thrown.

  was:
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07', c_trunc = 'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the lower_bound and upper_bound of the partition column of the 
existing files. If the newly added partition is within the upper bound and 
lower bound of the existing files then a ValidationException is thrown.


> Iceberg: Truncate partition support
> ---
>
> Key: HIVE-27672
> URL: https://issues.apache.org/jira/browse/HIVE-27672
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> Support the following truncate operations on a partition level - 
> {code:java}
> TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
> partValue2);{code}
> For partition transforms other than identity, the partition column must have 
> a suffix to the column as follows - 
> 1. Truncate transform on 'b' column - b_trunc
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
> 2. Bucket transform on 'b' column - b_bucket
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
> 3. Year transform on 'b' column - b_year - The 

[jira] [Updated] (HIVE-27672) Iceberg: Truncate partition support

2023-09-06 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27672:
--
Description: 
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07-13', c_trunc = 
'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the upper bound and lower bound of the existing files. If the newly 
added partition is within the upper bound and lower bound of the existing files 
then a ValidationException is thrown.

  was:
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07-13', c_trunc = 
'xy');{code}
For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the upper bound and lower bound of the existing files. If the newly 
added partition is within the upper bound and lower bound of the existing files 
then a ValidationException is thrown.


> Iceberg: Truncate partition support
> ---
>
> Key: HIVE-27672
> URL: https://issues.apache.org/jira/browse/HIVE-27672
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> Support the following truncate operations on a partition level - 
> {code:java}
> TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
> partValue2);{code}
> For partition transforms other than identity, the partition column must have 
> a suffix to the column as follows - 
> 1. Truncate transform on 'b' column - b_trunc
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
> 2. Bucket transform on 'b' column - b_bucket
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
> 3. Year transform on 'b' column - b_year - The value should be in  format.
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
> 4. Month transform on 'b' column - b_month - The value should be in -MM 
> format
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}

[jira] [Updated] (HIVE-27672) Iceberg: Truncate partition support

2023-09-06 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27672:
--
Description: 
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07', c_trunc = 'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the upper bound and lower bound of the existing files. If the newly 
added partition is within the upper bound and lower bound of the existing files 
then a ValidationException is thrown.

  was:
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13', c_trunc = 
'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the upper bound and lower bound of the existing files. If the newly 
added partition is within the upper bound and lower bound of the existing files 
then a ValidationException is thrown.


> Iceberg: Truncate partition support
> ---
>
> Key: HIVE-27672
> URL: https://issues.apache.org/jira/browse/HIVE-27672
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> Support the following truncate operations on a partition level - 
> {code:java}
> TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
> partValue2);{code}
> For partition transforms other than identity, the partition column must have 
> a suffix to the column as follows - 
> 1. Truncate transform on 'b' column - b_trunc
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
> 2. Bucket transform on 'b' column - b_bucket
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
> 3. Year transform on 'b' column - b_year - The value should be in  format.
> {code:java}

[jira] [Updated] (HIVE-27672) Iceberg: Truncate partition support

2023-09-06 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27672:
--
Description: 
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_hour = '2022-08-07-13', c_trunc = 
'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the upper bound and lower bound of the existing files. If the newly 
added partition is within the upper bound and lower bound of the existing files 
then a ValidationException is thrown.

  was:
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07-13', c_trunc = 
'xy');{code}
The motivation for specifying the inputs in the following format is based on 
the directory structure of the data in Iceberg tables. The input reflects the 
same value that are ideally seen the data directories in Iceberg tables.

For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the upper bound and lower bound of the existing files. If the newly 
added partition is within the upper bound and lower bound of the existing files 
then a ValidationException is thrown.


> Iceberg: Truncate partition support
> ---
>
> Key: HIVE-27672
> URL: https://issues.apache.org/jira/browse/HIVE-27672
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> Support the following truncate operations on a partition level - 
> {code:java}
> TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
> partValue2);{code}
> For partition transforms other than identity, the partition column must have 
> a suffix to the column as follows - 
> 1. Truncate transform on 'b' column - b_trunc
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
> 2. Bucket transform on 'b' column - b_bucket
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
> 3. Year transform on 'b' column - b_year - The value should be in  format.
> 

[jira] [Updated] (HIVE-27672) Iceberg: Truncate partition support

2023-09-06 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27672:
--
Description: 
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07-13', c_trunc = 
'xy');{code}
For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the upper bound and lower bound of the existing files. If the newly 
added partition is within the upper bound and lower bound of the existing files 
then a ValidationException is thrown.

  was:
Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Month transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07-13', c_trunc = 
'xy');{code}
For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the upper bound and lower bound of the existing files. If the newly 
added partition is within the upper bound and lower bound of the existing files 
then a ValidationException is thrown.


> Iceberg: Truncate partition support
> ---
>
> Key: HIVE-27672
> URL: https://issues.apache.org/jira/browse/HIVE-27672
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> Support the following truncate operations on a partition level - 
> {code:java}
> TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
> partValue2);{code}
> For partition transforms other than identity, the partition column must have 
> a suffix to the column as follows - 
> 1. Truncate transform on 'b' column - b_trunc
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
> 2. Bucket transform on 'b' column - b_bucket
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
> 3. Year transform on 'b' column - b_year - The value should be in  format.
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
> 4. Month transform on 'b' column - b_month - The value should be in -MM 
> format
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
> 5. Day transform on 'b' column - b_day - The value should be in -MM-DD 
> format
> {code:java}
> TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
> 6. Hour transform on 'b' column - b_hour - The value 

[jira] [Created] (HIVE-27672) Iceberg: Truncate partition support

2023-09-06 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-27672:
-

 Summary: Iceberg: Truncate partition support
 Key: HIVE-27672
 URL: https://issues.apache.org/jira/browse/HIVE-27672
 Project: Hive
  Issue Type: New Feature
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


Support the following truncate operations on a partition level - 
{code:java}
TRUNCATE TABLE tableName PARTITION (partCol1 = partValue1, partCol2 = 
partValue2);{code}
For partition transforms other than identity, the partition column must have a 
suffix to the column as follows - 
1. Truncate transform on 'b' column - b_trunc
{code:java}
TRUNCATE TABLE tableName PARTITION (b_trunc = 'xy');{code}
2. Bucket transform on 'b' column - b_bucket
{code:java}
TRUNCATE TABLE tableName PARTITION (b_bucket = 10);{code}
3. Year transform on 'b' column - b_year - The value should be in  format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_year = '2022');{code}
4. Month transform on 'b' column - b_month - The value should be in -MM 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_month = '2022-08'); {code}
5. Month transform on 'b' column - b_day - The value should be in -MM-DD 
format
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07');{code}
6. Hour transform on 'b' column - b_hour - The value should be in -MM-DD-HH 
format.
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07-13'); {code}
Specifying multiple conditions is also supported - 
{code:java}
TRUNCATE TABLE tableName PARTITION (b_day = '2022-08-07-13', c_trunc = 
'xy');{code}
For table which has undergone partition evolution, truncate is possible for 
only identity transform and is only possible for newly added partition which 
are outside the upper bound and lower bound of the existing files. If the newly 
added partition is within the upper bound and lower bound of the existing files 
then a ValidationException is thrown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27627) Iceberg: Insert into/overwrite partition support

2023-09-06 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27627:
--
Fix Version/s: 4.0.0

> Iceberg: Insert into/overwrite partition support
> 
>
> Key: HIVE-27627
> URL: https://issues.apache.org/jira/browse/HIVE-27627
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Support inserting data in the following query types -
> Inserting data via static partition -
> {code:java}
> INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol = pColValue) VALUES 
> (...);
> INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol = pColValue) SELECT 
> query;{code}
> Inserting data via dynamic partitioning - 
> {code:java}
> INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol) VALUES (...); 
> INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol) SELECT query; {code}
> Inserting data via static and dynamic partitioning with static partitioning 
> coming at the beginning - 
> {code:java}
> INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol1 = pColValue, pCol2) 
> VALUES (...); 
> INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol1 = pColValue, pCol2) 
> SELECT query;{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27627) Iceberg: Insert into/overwrite partition support

2023-09-05 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762015#comment-17762015
 ] 

Sourabh Badhya commented on HIVE-27627:
---

Thanks [~kkasa] for the review.

> Iceberg: Insert into/overwrite partition support
> 
>
> Key: HIVE-27627
> URL: https://issues.apache.org/jira/browse/HIVE-27627
> Project: Hive
>  Issue Type: New Feature
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
>
> Support inserting data in the following query types -
> Inserting data via static partition -
> {code:java}
> INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol = pColValue) VALUES 
> (...);
> INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol = pColValue) SELECT 
> query;{code}
> Inserting data via dynamic partitioning - 
> {code:java}
> INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol) VALUES (...); 
> INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol) SELECT query; {code}
> Inserting data via static and dynamic partitioning with static partitioning 
> coming at the beginning - 
> {code:java}
> INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol1 = pColValue, pCol2) 
> VALUES (...); 
> INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol1 = pColValue, pCol2) 
> SELECT query;{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27536) Merge task must be invoked after optimisation for external CTAS queries

2023-08-27 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-27536.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Thanks [~dkuzmenko], [~dengzh], [~aturoczy] for the reviews.

> Merge task must be invoked after optimisation for external CTAS queries
> ---
>
> Key: HIVE-27536
> URL: https://issues.apache.org/jira/browse/HIVE-27536
> Project: Hive
>  Issue Type: Bug
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Merge task is not invoked on S3 file system / object stores when CTAS query 
> is performed. 
> Repro test - Test.q
> {code:java}
> --! qt:dataset:src
> set hive.mapred.mode=nonstrict;
> set hive.explain.user=false;
> set hive.merge.mapredfiles=true;
> set hive.merge.mapfiles=true;
> set hive.merge.tezfiles=true;
> set hive.blobstore.supported.schemes=hdfs,file;
> set hive.merge.smallfiles.avgsize=7500;
> -- SORT_QUERY_RESULTS
> create table part_source(key string, value string) partitioned by (ds string);
> create table source(key string);
> -- The partitioned table must have 2 files per partition (necessary for merge 
> task)
> insert overwrite table part_source partition(ds='102') select * from src;
> insert into table part_source partition(ds='102') select * from src;
> insert overwrite table part_source partition(ds='103') select * from src;
> insert into table part_source partition(ds='102') select * from src;
> -- The unpartitioned table must have 2 files.
> insert overwrite table source select key from src;
> insert into table source select key from src;
> -- Create CTAS tables both for unpartitioned and partitioned cases for ORC 
> formats.
> explain analyze create external table ctas_table stored as orc as select * 
> from source;
> create external table ctas_table stored as orc as select * from source;
> explain analyze create external table ctas_part_table partitioned by (ds) 
> stored as orc as select * from part_source;
> create external table ctas_part_table partitioned by (ds) stored as orc as 
> select * from part_source;
> -- This must be 1 indicating there is 1 file after merge.
> select count(distinct(INPUT__FILE__NAME)) from ctas_table;
> -- This must be 2 indicating there is 1 file per partition after merge.
> select count(distinct(INPUT__FILE__NAME)) from ctas_part_table;
> -- Create CTAS tables both for unpartitioned and partitioned cases for 
> non-ORC formats.
> explain analyze create external table ctas_table_non_orc as select * from 
> source;
> create external table ctas_table_non_orc as select * from source;
> explain analyze create external table ctas_part_table_non_orc partitioned by 
> (ds) as select * from part_source;
> create external table ctas_part_table_non_orc partitioned by (ds) as select * 
> from part_source;
> -- This must be 1 indicating there is 1 file after merge.
> select count(distinct(INPUT__FILE__NAME)) from ctas_table_non_orc;
> -- This must be 2 indicating there is 1 file per partition after merge.
> select count(distinct(INPUT__FILE__NAME)) from ctas_part_table_non_orc;
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27627) Iceberg: Insert into/overwrite partition support

2023-08-16 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-27627:
-

 Summary: Iceberg: Insert into/overwrite partition support
 Key: HIVE-27627
 URL: https://issues.apache.org/jira/browse/HIVE-27627
 Project: Hive
  Issue Type: New Feature
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


Support inserting data in the following query types -
Inserting data via static partition -
{code:java}
INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol = pColValue) VALUES (...);
INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol = pColValue) SELECT 
query;{code}
Inserting data via dynamic partitioning - 
{code:java}
INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol) VALUES (...); 
INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol) SELECT query; {code}
Inserting data via static and dynamic partitioning with static partitioning 
coming at the beginning - 
{code:java}
INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol1 = pColValue, pCol2) 
VALUES (...); 
INSERT INTO|OVERWRITE TABLE tableName PARTITION(pCol1 = pColValue, pCol2) 
SELECT query;{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27536) Merge task must be invoked after optimisation for external CTAS queries

2023-07-28 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-27536:
-

 Summary: Merge task must be invoked after optimisation for 
external CTAS queries
 Key: HIVE-27536
 URL: https://issues.apache.org/jira/browse/HIVE-27536
 Project: Hive
  Issue Type: Bug
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


Merge task is not invoked on S3 file system / object stores when CTAS query is 
performed. 
Repro test - Test.q
{code:java}
--! qt:dataset:src
set hive.mapred.mode=nonstrict;
set hive.explain.user=false;
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.tezfiles=true;
set hive.blobstore.supported.schemes=hdfs,file;
set hive.merge.smallfiles.avgsize=7500;

-- SORT_QUERY_RESULTS

create table part_source(key string, value string) partitioned by (ds string);
create table source(key string);

-- The partitioned table must have 2 files per partition (necessary for merge 
task)
insert overwrite table part_source partition(ds='102') select * from src;
insert into table part_source partition(ds='102') select * from src;
insert overwrite table part_source partition(ds='103') select * from src;
insert into table part_source partition(ds='102') select * from src;

-- The unpartitioned table must have 2 files.
insert overwrite table source select key from src;
insert into table source select key from src;

-- Create CTAS tables both for unpartitioned and partitioned cases for ORC 
formats.
explain analyze create external table ctas_table stored as orc as select * from 
source;
create external table ctas_table stored as orc as select * from source;
explain analyze create external table ctas_part_table partitioned by (ds) 
stored as orc as select * from part_source;
create external table ctas_part_table partitioned by (ds) stored as orc as 
select * from part_source;

-- This must be 1 indicating there is 1 file after merge.
select count(distinct(INPUT__FILE__NAME)) from ctas_table;
-- This must be 2 indicating there is 1 file per partition after merge.
select count(distinct(INPUT__FILE__NAME)) from ctas_part_table;

-- Create CTAS tables both for unpartitioned and partitioned cases for non-ORC 
formats.
explain analyze create external table ctas_table_non_orc as select * from 
source;
create external table ctas_table_non_orc as select * from source;
explain analyze create external table ctas_part_table_non_orc partitioned by 
(ds) as select * from part_source;
create external table ctas_part_table_non_orc partitioned by (ds) as select * 
from part_source;

-- This must be 1 indicating there is 1 file after merge.
select count(distinct(INPUT__FILE__NAME)) from ctas_table_non_orc;
-- This must be 2 indicating there is 1 file per partition after merge.
select count(distinct(INPUT__FILE__NAME)) from ctas_part_table_non_orc;
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27513) Iceberg: Fetch task returns wrong results for Timestamp with local time zone datatype for Iceberg tables

2023-07-25 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17746853#comment-17746853
 ] 

Sourabh Badhya commented on HIVE-27513:
---

Thanks [~ayushtkn] and [~dkuzmenko] for the reviews.

> Iceberg: Fetch task returns wrong results for Timestamp with local time zone 
> datatype for Iceberg tables
> 
>
> Key: HIVE-27513
> URL: https://issues.apache.org/jira/browse/HIVE-27513
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0-beta-1
>
>
> Fetch task returns wrong results for Timestamp with local time zone datatype 
> for Iceberg tables
> Repro queries - 
> {code:java}
> create external table ice_ts_4(a int, ts timestamp with local time zone) 
> stored by iceberg;
> insert into ice_ts_4 values (1, current_timestamp());
> set hive.fetch.task.conversion=none;
> select * from ice_ts_4;
> +-+-+
> | ice_ts_4.a  | ice_ts_4.ts |
> +-+-+
> | 1   | 2021-08-16 06:37:30.425 US/Pacific  |
> +-+-+
> set hive.fetch.task.conversion=more;
> select * from ice_ts_4;
> +-++
> | ice_ts_4.a  |ice_ts_4.ts |
> +-++
> | 1   | 2021-08-16 13:37:30.425 Z  |
> +-++ {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27513) Iceberg: Fetch task returns wrong results for Timestamp with local time zone datatype for Iceberg tables

2023-07-19 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-27513:
-

 Summary: Iceberg: Fetch task returns wrong results for Timestamp 
with local time zone datatype for Iceberg tables
 Key: HIVE-27513
 URL: https://issues.apache.org/jira/browse/HIVE-27513
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


Fetch task returns wrong results for Timestamp with local time zone datatype 
for Iceberg tables
Repro queries - 
{code:java}
create external table ice_ts_4(a int, ts timestamp with local time zone) stored 
by iceberg;
insert into ice_ts_4 values (1, current_timestamp());
set hive.fetch.task.conversion=none;
select * from ice_ts_4;
+-+-+
| ice_ts_4.a  | ice_ts_4.ts |
+-+-+
| 1   | 2021-08-16 06:37:30.425 US/Pacific  |
+-+-+

set hive.fetch.task.conversion=more;
select * from ice_ts_4;
+-++
| ice_ts_4.a  |ice_ts_4.ts |
+-++
| 1   | 2021-08-16 13:37:30.425 Z  |
+-++ {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27455) Iceberg: Set COLUMN_STATS_ACCURATE after writing stats for Iceberg tables

2023-07-12 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17742381#comment-17742381
 ] 

Sourabh Badhya commented on HIVE-27455:
---

Thanks [~kkasa] and [~dkuzmenko] for the reviews.

> Iceberg: Set COLUMN_STATS_ACCURATE after writing stats for Iceberg tables
> -
>
> Key: HIVE-27455
> URL: https://issues.apache.org/jira/browse/HIVE-27455
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0-beta-1
>
>
> Currently, we are writing the stats to puffin files but we are not setting 
> the column stats accurate to the desired values.
> The focus of the Jira would be to update the field whenever non-native tables 
> update stats in their format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27452) Fix possible FNFE in HiveQueryLifeTimeHook::checkAndRollbackCTAS

2023-06-21 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735964#comment-17735964
 ] 

Sourabh Badhya commented on HIVE-27452:
---

Thanks [~ayushtkn] and [~dkuzmenko] for the reviews.

> Fix possible FNFE in HiveQueryLifeTimeHook::checkAndRollbackCTAS
> 
>
> Key: HIVE-27452
> URL: https://issues.apache.org/jira/browse/HIVE-27452
> Project: Hive
>  Issue Type: Bug
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> In case of a CTAS rollback, if the table directory is not created at all, 
> then while getting the owner of the table directory we might get a 
> FileNotFoundException.
> Hence, check whether the directory exists before submitting a request to 
> cleanup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27455) Iceberg: Set COLUMN_STATS_ACCURATE after writing stats for Iceberg tables

2023-06-20 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27455:
--
Component/s: Iceberg integration

> Iceberg: Set COLUMN_STATS_ACCURATE after writing stats for Iceberg tables
> -
>
> Key: HIVE-27455
> URL: https://issues.apache.org/jira/browse/HIVE-27455
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> Currently, we are writing the stats to puffin files but we are not setting 
> the column stats accurate to the desired values.
> The focus of the Jira would be to update the field whenever non-native tables 
> update stats in their format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27455) Iceberg: Set COLUMN_STATS_ACCURATE after writing stats for Iceberg tables

2023-06-20 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-27455:
-

 Summary: Iceberg: Set COLUMN_STATS_ACCURATE after writing stats 
for Iceberg tables
 Key: HIVE-27455
 URL: https://issues.apache.org/jira/browse/HIVE-27455
 Project: Hive
  Issue Type: Improvement
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


Currently, we are writing the stats to puffin files but we are not setting the 
column stats accurate to the desired values.
The focus of the Jira would be to update the field whenever non-native tables 
update stats in their format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27452) Fix possible FNFE in HiveQueryLifeTimeHook::checkAndRollbackCTAS

2023-06-19 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-27452:
-

 Summary: Fix possible FNFE in 
HiveQueryLifeTimeHook::checkAndRollbackCTAS
 Key: HIVE-27452
 URL: https://issues.apache.org/jira/browse/HIVE-27452
 Project: Hive
  Issue Type: Bug
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


In case of a CTAS rollback, if the table directory is not created at all, then 
while getting the owner of the table directory we might get a 
FileNotFoundException.

Hence, check whether the directory exists before submitting a request to 
cleanup.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27332) Add retry backoff mechanism for abort cleanup

2023-06-09 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17730892#comment-17730892
 ] 

Sourabh Badhya commented on HIVE-27332:
---

Thanks [~veghlaci05] and [~dkuzmenko] for the reviews.

> Add retry backoff mechanism for abort cleanup
> -
>
> Key: HIVE-27332
> URL: https://issues.apache.org/jira/browse/HIVE-27332
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> HIVE-27019 and HIVE-27020 added the functionality to directly clean data 
> directories from aborted transactions without using Initiator & Worker. 
> However, during the event of continuous failure during cleanup, the retry 
> mechanism is initiated every single time. We need to add retry backoff 
> mechanism to control the time required to initiate retry again and not 
> continuously retry.
> There are widely 3 cases wherein retry due to abort cleanup is impacted - 
> *1. Abort cleanup on the table failed + Compaction on the table failed.*
> *2. Abort cleanup on the table failed + Compaction on the table passed*
> *3. Abort cleanup on the table failed + No compaction on the table.*
> *Solution -* 
> *We reuse COMPACTION_QUEUE table to store the retry metadata -* 
> *Advantage: Most of the fields with respect to retry are present in 
> COMPACTION_QUEUE. Hence we can use the same for storing retry metadata. A 
> compaction type called ABORT_CLEANUP ('c') is introduced. The CQ_STATE will 
> remain ready for cleaning for such records.*
> *Actions performed by TaskHandler in the case of failure -* 
> *AbortTxnCleaner -* 
> Action: Just add retry details in the queue table during the abort failure.
> *CompactionCleaner -* 
> Action: If compaction on the same table is successful, delete the retry entry 
> in markCleaned when removing any TXN_COMPONENTS entries except when there are 
> no uncompacted aborts. We do not want to be in a situation where there is a 
> queue entry for a table but there is no record in TXN_COMPONENTS associated 
> with the same table.
> *Advantage: Expecting no performance issues with this approach. Since we 
> delete 1 record most of the times for the associated table/partition.*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27332) Add retry backoff mechanism for abort cleanup

2023-06-09 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-27332.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

> Add retry backoff mechanism for abort cleanup
> -
>
> Key: HIVE-27332
> URL: https://issues.apache.org/jira/browse/HIVE-27332
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> HIVE-27019 and HIVE-27020 added the functionality to directly clean data 
> directories from aborted transactions without using Initiator & Worker. 
> However, during the event of continuous failure during cleanup, the retry 
> mechanism is initiated every single time. We need to add retry backoff 
> mechanism to control the time required to initiate retry again and not 
> continuously retry.
> There are widely 3 cases wherein retry due to abort cleanup is impacted - 
> *1. Abort cleanup on the table failed + Compaction on the table failed.*
> *2. Abort cleanup on the table failed + Compaction on the table passed*
> *3. Abort cleanup on the table failed + No compaction on the table.*
> *Solution -* 
> *We reuse COMPACTION_QUEUE table to store the retry metadata -* 
> *Advantage: Most of the fields with respect to retry are present in 
> COMPACTION_QUEUE. Hence we can use the same for storing retry metadata. A 
> compaction type called ABORT_CLEANUP ('c') is introduced. The CQ_STATE will 
> remain ready for cleaning for such records.*
> *Actions performed by TaskHandler in the case of failure -* 
> *AbortTxnCleaner -* 
> Action: Just add retry details in the queue table during the abort failure.
> *CompactionCleaner -* 
> Action: If compaction on the same table is successful, delete the retry entry 
> in markCleaned when removing any TXN_COMPONENTS entries except when there are 
> no uncompacted aborts. We do not want to be in a situation where there is a 
> queue entry for a table but there is no record in TXN_COMPONENTS associated 
> with the same table.
> *Advantage: Expecting no performance issues with this approach. Since we 
> delete 1 record most of the times for the associated table/partition.*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27018) Move aborted transaction cleanup outside compaction process

2023-06-09 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-27018.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

>  Move aborted transaction cleanup outside compaction process
> 
>
> Key: HIVE-27018
> URL: https://issues.apache.org/jira/browse/HIVE-27018
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
> Fix For: 4.0.0
>
>
> Aborted transactions processing is tightly integrated into the compaction 
> pipeline and consists of 3 main stages: Initiator, Compactor (Worker), 
> Cleaner. That could be simplified by doing all work on the Cleaner side.
> *Potential Benefits -* 
> There are major advantages of implementing this on the cleaner side - 
>  1) Currently an aborted txn in the TXNS table blocks the cleaning of 
> TXN_TO_WRITE_ID table since nothing gets cleaned above MIN(aborted txnid) in 
> the current implementation. After implementing this on the cleaner side, the 
> cleaner regularly checks and cleans the aborted records in the TXN_COMPONENTS 
> table, which in turn makes the AcidTxnCleanerService clean the aborted txns 
> in TXNS table.
>  2) Initiator and worker do not do anything on tables which contain only 
> aborted directories. It's the cleaner which removes the aborted directories 
> of the table. Hence all operations associated with the initiator and worker 
> for these tables are wasteful. These wasteful operations are avoided.
> 3) DP writes which are aborted are skipped by the worker currently. Hence 
> once again the cleaner is the one deleting the aborted directories. All 
> operations associated with the initiator and worker for this entry are 
> wasteful. These wasteful operations are avoided.
> *Proposed solution -* 
> *Implement logic to handle aborted transactions exclusively in Cleaner.*
> Implement logic to fetch the TXN_COMPONENTS which are associated with 
> transactions in aborted state and send the required information to the 
> cleaner. Cleaner must clean up the aborted deltas/delete deltas by using the 
> aborted directories in the AcidState of the table/partition.
> It is also better to separate entities which provide information of 
> compaction and abort cleanup to enhance code modularity. This can be done in 
> this way -
> Cleaner can be divided into separate entities like - 
> *1) Handler* - This entity fetches the data from the metastore DB from 
> relevant tables and converts it into a request entity called CleaningRequest. 
> It would also do SQL operations post cleanup (postprocess). Every type of 
> cleaning request is provided by a separate handler.
> *2) Filesystem remover* - This entity fetches the cleaning requests from 
> various handlers and deletes them according to the cleaning request.
> *This division allows for dynamic extensibility of cleanup from multiple 
> handlers. Every handler is responsible for providing cleaning requests from a 
> specific source.*
> The following solution is resilient i.e. in the event of abrupt metastore 
> shutdown, the cleaner can still see the relevant entries in the metastore DB 
> and retry the cleaning task for that entry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27421) Do not set column stats in metastore when non-native table can store column stats in its own format

2023-06-08 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27421:
--
Description: 
Non-native table formats like Iceberg has the capability to store column stats 
in its own format (for Iceberg: Its stored in Puffin files).

However, these stats are stored in metastore as well after setting the column 
stats in its own format. We must avoid setting column stats in 2 places and 
must set only in a single place.

  was:
Non-native table formats like Iceberg has the capability to store stats in its 
own format (for Iceberg: Its stored in Puffin files).

However, these stats are stored in metastore as well after setting the stats in 
its own format. We must avoid setting stats in 2 places and must set only in a 
single place.


> Do not set column stats in metastore when non-native table can store column 
> stats in its own format
> ---
>
> Key: HIVE-27421
> URL: https://issues.apache.org/jira/browse/HIVE-27421
> Project: Hive
>  Issue Type: Bug
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
>
> Non-native table formats like Iceberg has the capability to store column 
> stats in its own format (for Iceberg: Its stored in Puffin files).
> However, these stats are stored in metastore as well after setting the column 
> stats in its own format. We must avoid setting column stats in 2 places and 
> must set only in a single place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27421) Do not set column stats in metastore when non-native table can store column stats in its own format

2023-06-08 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27421:
--
Summary: Do not set column stats in metastore when non-native table can 
store column stats in its own format  (was: Do not set stats in metastore when 
non-native table can store stats in its own format)

> Do not set column stats in metastore when non-native table can store column 
> stats in its own format
> ---
>
> Key: HIVE-27421
> URL: https://issues.apache.org/jira/browse/HIVE-27421
> Project: Hive
>  Issue Type: Bug
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
>
> Non-native table formats like Iceberg has the capability to store stats in 
> its own format (for Iceberg: Its stored in Puffin files).
> However, these stats are stored in metastore as well after setting the stats 
> in its own format. We must avoid setting stats in 2 places and must set only 
> in a single place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27421) Do not set stats in metastore when non-native table can store stats in its own format

2023-06-08 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-27421:
-

 Summary: Do not set stats in metastore when non-native table can 
store stats in its own format
 Key: HIVE-27421
 URL: https://issues.apache.org/jira/browse/HIVE-27421
 Project: Hive
  Issue Type: Bug
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


Non-native table formats like Iceberg has the capability to store stats in its 
own format (for Iceberg: Its stored in Puffin files).

However, these stats are stored in metastore as well after setting the stats in 
its own format. We must avoid setting stats in 2 places and must set only in a 
single place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27408) Parquet file opened for reading stats is never closed

2023-06-06 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17729632#comment-17729632
 ] 

Sourabh Badhya commented on HIVE-27408:
---

Thanks [~ayushtkn] , [~aturoczy] , [~akshatm] for the reviews.

> Parquet file opened for reading stats is never closed
> -
>
> Key: HIVE-27408
> URL: https://issues.apache.org/jira/browse/HIVE-27408
> Project: Hive
>  Issue Type: Bug
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> ParquetRecordWriterWrapper while closing the writer tries to collect the 
> stats by creating a reader (opening the file). But it never closes the reader 
> (never closes the file). This can leave the file open hence consuming memory 
> and associated file handle resources.
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java#L143-L155]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27408) Parquet file opened for reading stats is never closed

2023-06-03 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27408:
--
Description: 
ParquetRecordWriterWrapper while closing the writer tries to collect the stats 
by creating a reader (opening the file). But it never closes the reader (never 
closes the file). This can leave the file open hence consuming memory and 
associated file handle resources.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java#L143-L155]

  was:
ParquetRecordWriterWrapper while closing the writer tries to collect the stats 
by opening a reader. But it never closes the reader. This can leave the file 
open hence consuming memory and associated file handle resources.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java#L143-L155


> Parquet file opened for reading stats is never closed
> -
>
> Key: HIVE-27408
> URL: https://issues.apache.org/jira/browse/HIVE-27408
> Project: Hive
>  Issue Type: Bug
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
>
> ParquetRecordWriterWrapper while closing the writer tries to collect the 
> stats by creating a reader (opening the file). But it never closes the reader 
> (never closes the file). This can leave the file open hence consuming memory 
> and associated file handle resources.
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java#L143-L155]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27408) Parquet file opened for reading stats is never closed

2023-06-03 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-27408:
-

 Summary: Parquet file opened for reading stats is never closed
 Key: HIVE-27408
 URL: https://issues.apache.org/jira/browse/HIVE-27408
 Project: Hive
  Issue Type: Bug
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


ParquetRecordWriterWrapper while closing the writer tries to collect the stats 
by opening a reader. But it never closes the reader. This can leave the file 
open hence consuming memory and associated file handle resources.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/ParquetRecordWriterWrapper.java#L143-L155



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27332) Add retry backoff mechanism for abort cleanup

2023-06-01 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27332:
--
Description: 
HIVE-27019 and HIVE-27020 added the functionality to directly clean data 
directories from aborted transactions without using Initiator & Worker. 
However, during the event of continuous failure during cleanup, the retry 
mechanism is initiated every single time. We need to add retry backoff 
mechanism to control the time required to initiate retry again and not 
continuously retry.

There are widely 3 cases wherein retry due to abort cleanup is impacted - 
*1. Abort cleanup on the table failed + Compaction on the table failed.*
*2. Abort cleanup on the table failed + Compaction on the table passed*
*3. Abort cleanup on the table failed + No compaction on the table.*

*Solution -* 

*We reuse COMPACTION_QUEUE table to store the retry metadata -* 

*Advantage: Most of the fields with respect to retry are present in 
COMPACTION_QUEUE. Hence we can use the same for storing retry metadata. A 
compaction type called ABORT_CLEANUP ('c') is introduced. The CQ_STATE will 
remain ready for cleaning for such records.*

*Actions performed by TaskHandler in the case of failure -* 

*AbortTxnCleaner -* 
Action: Just add retry details in the queue table during the abort failure.
*CompactionCleaner -* 
Action: If compaction on the same table is successful, delete the retry entry 
in markCleaned when removing any TXN_COMPONENTS entries except when there are 
no uncompacted aborts. We do not want to be in a situation where there is a 
queue entry for a table but there is no record in TXN_COMPONENTS associated 
with the same table.

*Advantage: Expecting no performance issues with this approach. Since we delete 
1 record most of the times for the associated table/partition.*

  was:
HIVE-27019 and HIVE-27020 added the functionality to directly clean data 
directories from aborted transactions without using Initiator & Worker. 
However, during the event of continuous failure during cleanup, the retry 
mechanism is initiated every single time. We need to add retry backoff 
mechanism to control the time required to initiate retry again and not 
continuously retry.

There are widely 3 cases wherein retry due to abort cleanup is impacted - 
*1. Abort cleanup on the table failed + Compaction on the table failed.*
*2. Abort cleanup on the table failed + Compaction on the table passed*
*3. Abort cleanup on the table failed + No compaction on the table.*

*Solution -* 

*We create a new table called TXN_CLEANUP_QUEUE with following fields to store 
the retry metadata -* 
CREATE TABLE TXN_CLEANUP_QUEUE (
TCQ_DATABASE varchar(128) NOT NULL, 
TCQ_TABLE varchar(256) NOT NULL,
TCQ_PARTITION varchar(767), 
TCQ_RETRY_RETENTION bigint NOT NULL DEFAULT 0, 
TCQ_ERROR_MESSAGE mediumtext in MySQL / clob in derby, oracle DB / text in 
postgres / varchar(max) in mssql DB

);

*Advantage: Separates the flow of metadata. We also eliminate the chance of 
breaking the compaction/abort cleanup when modifying metadata of abort 
cleanup/compaction. Easier debugging in case of failures.*

*Actions performed by TaskHandler in the case of failure -* 

*AbortTxnCleaner -* 
Action: Just add retry details in the queue table during the abort failure.
*CompactionCleaner -* 
Action: If compaction on the same table is successful, delete the retry entry 
in markCleaned when removing any TXN_COMPONENTS entries except when there are 
no uncompacted aborts. We do not want to be in a situation where there is a 
queue entry for a table but there is no record in TXN_COMPONENTS associated 
with the same table.

*Advantage: Expecting no performance issues with this approach. Since we delete 
1 record most of the times for the associated table/partition.*


> Add retry backoff mechanism for abort cleanup
> -
>
> Key: HIVE-27332
> URL: https://issues.apache.org/jira/browse/HIVE-27332
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> HIVE-27019 and HIVE-27020 added the functionality to directly clean data 
> directories from aborted transactions without using Initiator & Worker. 
> However, during the event of continuous failure during cleanup, the retry 
> mechanism is initiated every single time. We need to add retry backoff 
> mechanism to control the time required to initiate retry again and not 
> continuously retry.
> There are widely 3 cases wherein retry due to abort cleanup is impacted - 
> *1. Abort cleanup on the table failed + Compaction on the table failed.*
> *2. Abort cleanup on the table failed + Compaction on the table passed*
> *3. Abort cleanup on the 

[jira] [Commented] (HIVE-27019) Split Cleaner into separate manageable modular entities

2023-05-18 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17723809#comment-17723809
 ] 

Sourabh Badhya commented on HIVE-27019:
---

An addendum PR was merged - [https://github.com/apache/hive/pull/4332]

Thanks for the reviews - [~dkuzmenko] @Attila Turoczy

> Split Cleaner into separate manageable modular entities
> ---
>
> Key: HIVE-27019
> URL: https://issues.apache.org/jira/browse/HIVE-27019
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 19.5h
>  Remaining Estimate: 0h
>
> As described by the parent task - 
> Cleaner can be divided into separate entities like -
> *1) Handler* - This entity fetches the data from the metastore DB from 
> relevant tables and converts it into a request entity called CleaningRequest. 
> It would also do SQL operations post cleanup (postprocess). Every type of 
> cleaning request is provided by a separate handler.
> *2) Filesystem remover* - This entity fetches the cleaning requests from 
> various handlers and deletes them according to the cleaning request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27332) Add retry backoff mechanism for abort cleanup

2023-05-11 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27332:
--
Description: 
HIVE-27019 and HIVE-27020 added the functionality to directly clean data 
directories from aborted transactions without using Initiator & Worker. 
However, during the event of continuous failure during cleanup, the retry 
mechanism is initiated every single time. We need to add retry backoff 
mechanism to control the time required to initiate retry again and not 
continuously retry.

There are widely 3 cases wherein retry due to abort cleanup is impacted - 
*1. Abort cleanup on the table failed + Compaction on the table failed.*
*2. Abort cleanup on the table failed + Compaction on the table passed*
*3. Abort cleanup on the table failed + No compaction on the table.*

*Solution -* 

*We create a new table called TXN_CLEANUP_QUEUE with following fields to store 
the retry metadata -* 
CREATE TABLE TXN_CLEANUP_QUEUE (
TCQ_DATABASE varchar(128) NOT NULL, 
TCQ_TABLE varchar(256) NOT NULL,
TCQ_PARTITION varchar(767), 
TCQ_RETRY_RETENTION bigint NOT NULL DEFAULT 0, 
TCQ_ERROR_MESSAGE mediumtext in MySQL / clob in derby, oracle DB / text in 
postgres / varchar(max) in mssql DB

);

*Advantage: Separates the flow of metadata. We also eliminate the chance of 
breaking the compaction/abort cleanup when modifying metadata of abort 
cleanup/compaction. Easier debugging in case of failures.*

*Actions performed by TaskHandler in the case of failure -* 

*AbortTxnCleaner -* 
Action: Just add retry details in the queue table during the abort failure.
*CompactionCleaner -* 
Action: If compaction on the same table is successful, delete the retry entry 
in markCleaned when removing any TXN_COMPONENTS entries except when there are 
no uncompacted aborts. We do not want to be in a situation where there is a 
queue entry for a table but there is no record in TXN_COMPONENTS associated 
with the same table.

*Advantage: Expecting no performance issues with this approach. Since we delete 
1 record most of the times for the associated table/partition.*

  was:
HIVE-27019 and HIVE-27020 added the functionality to directly clean data 
directories from aborted transactions without using Initiator & Worker. 
However, during the event of continuous failure during cleanup, the retry 
mechanism is initiated every single time. We need to add retry backoff 
mechanism to control the time required to initiate retry again and not 
continuously retry.

There are widely 3 cases wherein retry due to abort cleanup is impacted - 
*1. Abort cleanup on the table failed + Compaction on the table failed.*
*2. Abort cleanup on the table failed + Compaction on the table passed*
*3. Abort cleanup on the table failed + No compaction on the table.*

*Solution -* 

*We create a new table called TXN_CLEANUP_QUEUE with following fields to store 
the retry metadata -* 
CREATE TABLE TXN_CLEANUP_QUEUE (
TCQ_DATABASE varchar(128) NOT NULL, 
TCQ_TABLE varchar(256) NOT NULL,
TCQ_PARTITION varchar(767), 
TCQ_RETRY_RETENTION bigint NOT NULL DEFAULT 0, 
TCQ_ERROR_MESSAGE mediumtext in MySQL / clob in derby, oracle DB / text in 
postgres / varchar(max) in mssql DB

);

*Advantage: Separates the flow of metadata. We also eliminate the chance of 
breaking the compaction/abort cleanup when modifying metadata of abort 
cleanup/compaction. Easier debugging in case of failures.*

*Actions performed by TaskHandler in the case of failure -* 
**

*AbortTxnCleaner -* 
Action: Just add retry details in the queue table during the abort failure.
{*}CompactionCleaner -{*} 
Action: If compaction on the same table is successful, delete the retry entry 
in markCleaned when removing any TXN_COMPONENTS entries except when there are 
no uncompacted aborts. We do not want to be in a situation where there is a 
queue entry for a table but there is no record in TXN_COMPONENTS associated 
with the same table.

{*}Advantage: Expecting no performance issues with this approach. Since we 
delete 1 record most of the times for the associated table/partition.{*}


> Add retry backoff mechanism for abort cleanup
> -
>
> Key: HIVE-27332
> URL: https://issues.apache.org/jira/browse/HIVE-27332
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> HIVE-27019 and HIVE-27020 added the functionality to directly clean data 
> directories from aborted transactions without using Initiator & Worker. 
> However, during the event of continuous failure during cleanup, the retry 
> mechanism is initiated every single time. We need to add retry backoff 
> mechanism to control the time required to initiate retry again and not 
> continuously retry.
> There are widely 3 cases wherein retry due to abort cleanup is 

[jira] [Created] (HIVE-27332) Add retry backoff mechanism for abort cleanup

2023-05-11 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-27332:
-

 Summary: Add retry backoff mechanism for abort cleanup
 Key: HIVE-27332
 URL: https://issues.apache.org/jira/browse/HIVE-27332
 Project: Hive
  Issue Type: Sub-task
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


HIVE-27019 and HIVE-27020 added the functionality to directly clean data 
directories from aborted transactions without using Initiator & Worker. 
However, during the event of continuous failure during cleanup, the retry 
mechanism is initiated every single time. We need to add retry backoff 
mechanism to control the time required to initiate retry again and not 
continuously retry.

There are widely 3 cases wherein retry due to abort cleanup is impacted - 
*1. Abort cleanup on the table failed + Compaction on the table failed.*
*2. Abort cleanup on the table failed + Compaction on the table passed*
*3. Abort cleanup on the table failed + No compaction on the table.*

*Solution -* 

*We create a new table called TXN_CLEANUP_QUEUE with following fields to store 
the retry metadata -* 
CREATE TABLE TXN_CLEANUP_QUEUE (
TCQ_DATABASE varchar(128) NOT NULL, 
TCQ_TABLE varchar(256) NOT NULL,
TCQ_PARTITION varchar(767), 
TCQ_RETRY_RETENTION bigint NOT NULL DEFAULT 0, 
TCQ_ERROR_MESSAGE mediumtext in MySQL / clob in derby, oracle DB / text in 
postgres / varchar(max) in mssql DB

);

*Advantage: Separates the flow of metadata. We also eliminate the chance of 
breaking the compaction/abort cleanup when modifying metadata of abort 
cleanup/compaction. Easier debugging in case of failures.*

*Actions performed by TaskHandler in the case of failure -* 
**

*AbortTxnCleaner -* 
Action: Just add retry details in the queue table during the abort failure.
{*}CompactionCleaner -{*} 
Action: If compaction on the same table is successful, delete the retry entry 
in markCleaned when removing any TXN_COMPONENTS entries except when there are 
no uncompacted aborts. We do not want to be in a situation where there is a 
queue entry for a table but there is no record in TXN_COMPONENTS associated 
with the same table.

{*}Advantage: Expecting no performance issues with this approach. Since we 
delete 1 record most of the times for the associated table/partition.{*}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27020) Implement a separate handler to handle aborted transaction cleanup

2023-04-24 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27020:
--
Fix Version/s: 4.0.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Thanks [~veghlaci05] , [~akshatm] , [~dkuzmenko] for the reviews.

> Implement a separate handler to handle aborted transaction cleanup
> --
>
> Key: HIVE-27020
> URL: https://issues.apache.org/jira/browse/HIVE-27020
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 17h 20m
>  Remaining Estimate: 0h
>
> As described in the parent task, once the cleaner is separated into different 
> entities, implement a separate handler which can create requests for aborted 
> transactions cleanup. This would move the aborted transaction cleanup 
> exclusively to the cleaner.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27270) Remove the code which creates compaction request for aborts

2023-04-19 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27270:
--
Description: Remove the code which creates compaction request for aborts. 
The purpose of this task is to remove the code once it is determined that the 
abort cleanup feature is stable.  (was: Remove the code which creates 
compaction request for aborts in Initiator. The purpose of this task is to 
remove the code once it is determined that the abort cleanup feature is stable.)

> Remove the code which creates compaction request for aborts
> ---
>
> Key: HIVE-27270
> URL: https://issues.apache.org/jira/browse/HIVE-27270
> Project: Hive
>  Issue Type: Task
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> Remove the code which creates compaction request for aborts. The purpose of 
> this task is to remove the code once it is determined that the abort cleanup 
> feature is stable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27270) Remove the code which creates compaction request for aborts

2023-04-19 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27270:
--
Summary: Remove the code which creates compaction request for aborts  (was: 
Remove the code which creates compaction request for aborts in Initiator)

> Remove the code which creates compaction request for aborts
> ---
>
> Key: HIVE-27270
> URL: https://issues.apache.org/jira/browse/HIVE-27270
> Project: Hive
>  Issue Type: Task
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> Remove the code which creates compaction request for aborts in Initiator. The 
> purpose of this task is to remove the code once it is determined that the 
> abort cleanup feature is stable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27270) Remove the code which creates compaction request for aborts in Initiator

2023-04-19 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-27270:
-

 Summary: Remove the code which creates compaction request for 
aborts in Initiator
 Key: HIVE-27270
 URL: https://issues.apache.org/jira/browse/HIVE-27270
 Project: Hive
  Issue Type: Task
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


Remove the code which creates compaction request for aborts in Initiator. The 
purpose of this task is to remove the code once it is determined that the abort 
cleanup feature is stable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27267) Incorrect results when doing bucket map join on decimal bucketed column with subquery

2023-04-17 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27267:
--
Description: 
The following queries when run on a Hive cluster produce no results - 
Repro queries - 
{code:java}
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.support.concurrency=true;
set hive.convert.join.bucket.mapjoin.tez=true;

drop table if exists test_external_source;
create external table test_external_source (date_col date, string_col string, 
decimal_col decimal(38,0)) stored as orc tblproperties 
('external.table.purge'='true');
insert into table test_external_source values ('2022-08-30', 'pipeline', 
'5005905545593'), ('2022-08-16', 'pipeline', 
'5005905545593'), ('2022-09-01', 'pipeline', 
'5006008686831'), ('2022-08-30', 'pipeline', 
'5005992620837'), ('2022-09-01', 'pipeline', 
'5005992620837'), ('2022-09-01', 'pipeline', 
'5005992621067'), ('2022-08-30', 'pipeline', 
'5005992621067');

drop table if exists test_external_target;
create external table test_external_target (date_col date, string_col string, 
decimal_col decimal(38,0)) stored as orc tblproperties 
('external.table.purge'='true');
insert into table test_external_target values ('2017-05-17', 'pipeline', 
'5000441610525'), ('2018-12-20', 'pipeline', 
'5001048981030'), ('2020-06-30', 'pipeline', 
'5002332575516'), ('2021-08-16', 'pipeline', 
'5003897973989'), ('2017-06-06', 'pipeline', 
'5000449148729'), ('2017-09-08', 'pipeline', 
'5000525378314'), ('2022-08-30', 'pipeline', 
'5005905545593'), ('2022-08-16', 'pipeline', 
'5005905545593'), ('2018-05-03', 'pipeline', 
'5000750826355'), ('2020-01-10', 'pipeline', 
'5001816579677'), ('2021-11-01', 'pipeline', 
'5004269423714'), ('2017-11-07', 'pipeline', 
'5000585901787'), ('2019-10-15', 'pipeline', 
'5001598843430'), ('2020-04-01', 'pipeline', 
'5002035795461'), ('2020-02-24', 'pipeline', 
'5001932600185'), ('2020-04-27', 'pipeline', 
'5002108160849'), ('2016-07-05', 'pipeline', 
'554405114'), ('2020-06-02', 'pipeline', 
'5002234387967'), ('2020-08-21', 'pipeline', 
'5002529168758'), ('2021-02-17', 'pipeline', 
'5003158511687');

drop table if exists target_table;
drop table if exists source_table;
create table target_table(date_col date, string_col string, decimal_col 
decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc 
tblproperties ('bucketing_version'='2', 'transactional'='true', 
'transactional_properties'='default');
create table source_table(date_col date, string_col string, decimal_col 
decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc 
tblproperties ('bucketing_version'='2', 'transactional'='true', 
'transactional_properties'='default');

insert into table target_table select * from test_external_target;
insert into table source_table select * from test_external_source; {code}
Query which is under investigation - 
{code:java}
select * from target_table inner join (select distinct date_col, 'pipeline' 
string_col, decimal_col from source_table where coalesce(decimal_col,'') = 
'5005905545593') s on s.date_col = target_table.date_col AND 
s.string_col = target_table.string_col AND s.decimal_col = 
target_table.decimal_col; {code}
Expected result of the query - 2 records
{code:java}
++--++-+---++
| target_table.date_col  | target_table.string_col  |    
target_table.decimal_col    | s.date_col  | s.string_col  |         
s.decimal_col          |
++--++-+---++
| 2022-08-16             | pipeline                 | 
5005905545593  | 2022-08-16  | pipeline      | 
5005905545593  |
| 2022-08-30             | pipeline                 | 
5005905545593  | 2022-08-30  | pipeline      | 
5005905545593  |
++--++-+---++
 {code}
Actual result of the query - No records
{code:java}
++--+---+-+---++
| target_table.date_col  | target_table.string_col  | target_table.decimal_col  
| s.date_col  | 

[jira] [Created] (HIVE-27267) Incorrect results when doing bucket map join on decimal bucketed column with subquery

2023-04-17 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-27267:
-

 Summary: Incorrect results when doing bucket map join on decimal 
bucketed column with subquery
 Key: HIVE-27267
 URL: https://issues.apache.org/jira/browse/HIVE-27267
 Project: Hive
  Issue Type: Bug
Reporter: Sourabh Badhya


The following queries when run on a Hive cluster produce no results - 
Repro queries - 
{code:java}
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.support.concurrency=true;
set hive.convert.join.bucket.mapjoin.tez=true;

drop table if exists test_external_source;
create external table test_external_source (date_col date, string_col string, 
decimal_col decimal(38,0)) stored as orc tblproperties 
('external.table.purge'='true');
insert into table test_external_source values ('2022-08-30', 'pipeline', 
'5005905545593'), ('2022-08-16', 'pipeline', 
'5005905545593'), ('2022-09-01', 'pipeline', 
'5006008686831'), ('2022-08-30', 'pipeline', 
'5005992620837'), ('2022-09-01', 'pipeline', 
'5005992620837'), ('2022-09-01', 'pipeline', 
'5005992621067'), ('2022-08-30', 'pipeline', 
'5005992621067');

drop table if exists test_external_target;
create external table test_external_target (date_col date, string_col string, 
decimal_col decimal(38,0)) stored as orc tblproperties 
('external.table.purge'='true');
insert into table test_external_target values ('2017-05-17', 'pipeline', 
'5000441610525'), ('2018-12-20', 'pipeline', 
'5001048981030'), ('2020-06-30', 'pipeline', 
'5002332575516'), ('2021-08-16', 'pipeline', 
'5003897973989'), ('2017-06-06', 'pipeline', 
'5000449148729'), ('2017-09-08', 'pipeline', 
'5000525378314'), ('2022-08-30', 'pipeline', 
'5005905545593'), ('2022-08-16', 'pipeline', 
'5005905545593'), ('2018-05-03', 'pipeline', 
'5000750826355'), ('2020-01-10', 'pipeline', 
'5001816579677'), ('2021-11-01', 'pipeline', 
'5004269423714'), ('2017-11-07', 'pipeline', 
'5000585901787'), ('2019-10-15', 'pipeline', 
'5001598843430'), ('2020-04-01', 'pipeline', 
'5002035795461'), ('2020-02-24', 'pipeline', 
'5001932600185'), ('2020-04-27', 'pipeline', 
'5002108160849'), ('2016-07-05', 'pipeline', 
'554405114'), ('2020-06-02', 'pipeline', 
'5002234387967'), ('2020-08-21', 'pipeline', 
'5002529168758'), ('2021-02-17', 'pipeline', 
'5003158511687');

drop table if exists target_table;
drop table if exists source_table;
create table target_table(date_col date, string_col string, decimal_col 
decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc 
tblproperties ('bucketing_version'='2', 'transactional'='true', 
'transactional_properties'='default');
create table source_table(date_col date, string_col string, decimal_col 
decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc 
tblproperties ('bucketing_version'='2', 'transactional'='true', 
'transactional_properties'='default');

insert into table target_table select * from test_external_target;
insert into table source_table select * from test_external_source; {code}
Query which is under investigation - 
{code:java}
select * from target_table inner join (select distinct date_col, 'pipeline' 
string_col, decimal_col from source_table where coalesce(decimal_col,'') = 
'5005905545593') s on s.date_col = target_table.date_col AND 
s.string_col = target_table.string_col AND s.decimal_col = 
target_table.decimal_col; {code}
Expected result of the query - 2 records
{code:java}
++--++-+---++
| target_table.date_col  | target_table.string_col  |    
target_table.decimal_col    | s.date_col  | s.string_col  |         
s.decimal_col          |
++--++-+---++
| 2022-08-16             | pipeline                 | 
5005905545593  | 2022-08-16  | pipeline      | 
5005905545593  |
| 2022-08-30             | pipeline                 | 
5005905545593  | 2022-08-30  | pipeline      | 
5005905545593  |
++--++-+---++
 {code}
Actual result of the query - No records
{code:java}

[jira] [Updated] (HIVE-27267) Incorrect results when doing bucket map join on decimal bucketed column with subquery

2023-04-17 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27267:
--
Description: 
The following queries when run on a Hive cluster produce no results - 
Repro queries - 
{code:java}
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.support.concurrency=true;
set hive.convert.join.bucket.mapjoin.tez=true;

drop table if exists test_external_source;
create external table test_external_source (date_col date, string_col string, 
decimal_col decimal(38,0)) stored as orc tblproperties 
('external.table.purge'='true');
insert into table test_external_source values ('2022-08-30', 'pipeline', 
'5005905545593'), ('2022-08-16', 'pipeline', 
'5005905545593'), ('2022-09-01', 'pipeline', 
'5006008686831'), ('2022-08-30', 'pipeline', 
'5005992620837'), ('2022-09-01', 'pipeline', 
'5005992620837'), ('2022-09-01', 'pipeline', 
'5005992621067'), ('2022-08-30', 'pipeline', 
'5005992621067');

drop table if exists test_external_target;
create external table test_external_target (date_col date, string_col string, 
decimal_col decimal(38,0)) stored as orc tblproperties 
('external.table.purge'='true');
insert into table test_external_target values ('2017-05-17', 'pipeline', 
'5000441610525'), ('2018-12-20', 'pipeline', 
'5001048981030'), ('2020-06-30', 'pipeline', 
'5002332575516'), ('2021-08-16', 'pipeline', 
'5003897973989'), ('2017-06-06', 'pipeline', 
'5000449148729'), ('2017-09-08', 'pipeline', 
'5000525378314'), ('2022-08-30', 'pipeline', 
'5005905545593'), ('2022-08-16', 'pipeline', 
'5005905545593'), ('2018-05-03', 'pipeline', 
'5000750826355'), ('2020-01-10', 'pipeline', 
'5001816579677'), ('2021-11-01', 'pipeline', 
'5004269423714'), ('2017-11-07', 'pipeline', 
'5000585901787'), ('2019-10-15', 'pipeline', 
'5001598843430'), ('2020-04-01', 'pipeline', 
'5002035795461'), ('2020-02-24', 'pipeline', 
'5001932600185'), ('2020-04-27', 'pipeline', 
'5002108160849'), ('2016-07-05', 'pipeline', 
'554405114'), ('2020-06-02', 'pipeline', 
'5002234387967'), ('2020-08-21', 'pipeline', 
'5002529168758'), ('2021-02-17', 'pipeline', 
'5003158511687');

drop table if exists target_table;
drop table if exists source_table;
create table target_table(date_col date, string_col string, decimal_col 
decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc 
tblproperties ('bucketing_version'='2', 'transactional'='true', 
'transactional_properties'='default');
create table source_table(date_col date, string_col string, decimal_col 
decimal(38,0)) clustered by (decimal_col) into 7 buckets stored as orc 
tblproperties ('bucketing_version'='2', 'transactional'='true', 
'transactional_properties'='default');

insert into table target_table select * from test_external_target;
insert into table source_table select * from test_external_source; {code}
Query which is under investigation - 
{code:java}
select * from target_table inner join (select distinct date_col, 'pipeline' 
string_col, decimal_col from source_table where coalesce(decimal_col,'') = 
'5005905545593') s on s.date_col = target_table.date_col AND 
s.string_col = target_table.string_col AND s.decimal_col = 
target_table.decimal_col; {code}
Expected result of the query - 2 records
{code:java}
++--++-+---++
| target_table.date_col  | target_table.string_col  |    
target_table.decimal_col    | s.date_col  | s.string_col  |         
s.decimal_col          |
++--++-+---++
| 2022-08-16             | pipeline                 | 
5005905545593  | 2022-08-16  | pipeline      | 
5005905545593  |
| 2022-08-30             | pipeline                 | 
5005905545593  | 2022-08-30  | pipeline      | 
5005905545593  |
++--++-+---++
 {code}
Actual result of the query - No records
{code:java}
++--+---+-+---++
| target_table.date_col  | target_table.string_col  | target_table.decimal_col  
| s.date_col  | 

[jira] [Commented] (HIVE-27228) Add missing upgrade SQL statements after CQ_NUMBER_OF_BUCKETS column being introduced in HIVE-26719

2023-04-11 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17710438#comment-17710438
 ] 

Sourabh Badhya commented on HIVE-27228:
---

Thanks [~veghlaci05] and [~scarlin] for the reviews.

> Add missing upgrade SQL statements after CQ_NUMBER_OF_BUCKETS column being 
> introduced in HIVE-26719
> ---
>
> Key: HIVE-27228
> URL: https://issues.apache.org/jira/browse/HIVE-27228
> Project: Hive
>  Issue Type: Bug
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> HIVE-26719 introduced CQ_NUMBER_OF_BUCKETS column in COMPACTION_QUEUE table 
> and COMPLETED_COMPACTIONS table. However, the corresponding upgrade SQL 
> statements is missing for these columns. Also CQ_NUMBER_OF_BUCKETS is not 
> updated in the COMPACTIONS view in information schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HIVE-27228) Add missing upgrade SQL statements after CQ_NUMBER_OF_BUCKETS column being introduced in HIVE-26719

2023-04-06 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya reassigned HIVE-27228:
-


> Add missing upgrade SQL statements after CQ_NUMBER_OF_BUCKETS column being 
> introduced in HIVE-26719
> ---
>
> Key: HIVE-27228
> URL: https://issues.apache.org/jira/browse/HIVE-27228
> Project: Hive
>  Issue Type: Bug
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> HIVE-26719 introduced CQ_NUMBER_OF_BUCKETS column in COMPACTION_QUEUE table 
> and COMPLETED_COMPACTIONS table. However, the corresponding upgrade SQL 
> statements is missing for these columns. Also CQ_NUMBER_OF_BUCKETS is not 
> updated in the COMPACTIONS view in information schema.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27168) Use basename of the datatype when fetching partition metadata using partition filters

2023-03-27 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705212#comment-17705212
 ] 

Sourabh Badhya commented on HIVE-27168:
---

Thanks [~kokila19] , [~rkirtir] , [~akshatm] , [~InvisibleProgrammer] , 
[~veghlaci05] , [~dkuzmenko] for the reviews.

> Use basename of the datatype when fetching partition metadata using partition 
> filters
> -
>
> Key: HIVE-27168
> URL: https://issues.apache.org/jira/browse/HIVE-27168
> Project: Hive
>  Issue Type: Bug
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> While fetching partition metadata using partition filters, we use the column 
> type of the table directly. However, char/varchar types can contain extra 
> information such as length of the char/varchar column and hence it skips 
> fetching partition metadata due to this extra information.
> Solution: Use the basename of the column type while deciding on whether 
> partition pruning can be done on the partitioned column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27168) Use basename of the datatype when fetching partition metadata using partition filters

2023-03-23 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27168:
--
Status: Patch Available  (was: Open)

> Use basename of the datatype when fetching partition metadata using partition 
> filters
> -
>
> Key: HIVE-27168
> URL: https://issues.apache.org/jira/browse/HIVE-27168
> Project: Hive
>  Issue Type: Bug
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While fetching partition metadata using partition filters, we use the column 
> type of the table directly. However, char/varchar types can contain extra 
> information such as length of the char/varchar column and hence it skips 
> fetching partition metadata due to this extra information.
> Solution: Use the basename of the column type while deciding on whether 
> partition pruning can be done on the partitioned column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HIVE-27168) Use basename of the datatype when fetching partition metadata using partition filters

2023-03-23 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya reassigned HIVE-27168:
-


> Use basename of the datatype when fetching partition metadata using partition 
> filters
> -
>
> Key: HIVE-27168
> URL: https://issues.apache.org/jira/browse/HIVE-27168
> Project: Hive
>  Issue Type: Bug
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> While fetching partition metadata using partition filters, we use the column 
> type of the table directly. However, char/varchar types can contain extra 
> information such as length of the char/varchar column and hence it skips 
> fetching partition metadata due to this extra information.
> Solution: Use the basename of the column type while deciding on whether 
> partition pruning can be done on the partitioned column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27020) Implement a separate handler to handle aborted transaction cleanup

2023-03-13 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27020:
--
Status: Patch Available  (was: Open)

> Implement a separate handler to handle aborted transaction cleanup
> --
>
> Key: HIVE-27020
> URL: https://issues.apache.org/jira/browse/HIVE-27020
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> As described in the parent task, once the cleaner is separated into different 
> entities, implement a separate handler which can create requests for aborted 
> transactions cleanup. This would move the aborted transaction cleanup 
> exclusively to the cleaner.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27122) Use Caffeine for caching metadata objects in Compactor threads

2023-03-08 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697789#comment-17697789
 ] 

Sourabh Badhya commented on HIVE-27122:
---

Thanks [~akshatm] , [~kokila19] , [~rkirtir] , [~veghlaci05] for the reviews.

> Use Caffeine for caching metadata objects in Compactor threads
> --
>
> Key: HIVE-27122
> URL: https://issues.apache.org/jira/browse/HIVE-27122
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Currently, compactor threads make use of Guava package to cache metadata 
> objects like database/table objects. We should consider using Caffeine 
> package since it provides more control on the cache. It is also observed that 
> cache created from Caffeine package is more performant than cache created 
> from Guava package.
> Some benchmarks comparing Caffeine package vs Guava package - 
> [https://github.com/ben-manes/caffeine/wiki/Benchmarks]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27122) Use Caffeine for caching metadata objects in Compactor threads

2023-03-06 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27122:
--
Status: Patch Available  (was: Open)

> Use Caffeine for caching metadata objects in Compactor threads
> --
>
> Key: HIVE-27122
> URL: https://issues.apache.org/jira/browse/HIVE-27122
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, compactor threads make use of Guava package to cache metadata 
> objects like database/table objects. We should consider using Caffeine 
> package since it provides more control on the cache. It is also observed that 
> cache created from Caffeine package is more performant than cache created 
> from Guava package.
> Some benchmarks comparing Caffeine package vs Guava package - 
> [https://github.com/ben-manes/caffeine/wiki/Benchmarks]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HIVE-27122) Use Caffeine for caching metadata objects in Compactor threads

2023-03-06 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya reassigned HIVE-27122:
-


> Use Caffeine for caching metadata objects in Compactor threads
> --
>
> Key: HIVE-27122
> URL: https://issues.apache.org/jira/browse/HIVE-27122
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> Currently, compactor threads make use of Guava package to cache metadata 
> objects like database/table objects. We should consider using Caffeine 
> package since it provides more control on the cache. It is also observed that 
> cache created from Caffeine package is more performant than cache created 
> from Guava package.
> Some benchmarks comparing Caffeine package vs Guava package - 
> [https://github.com/ben-manes/caffeine/wiki/Benchmarks]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27019) Split Cleaner into separate manageable modular entities

2023-03-03 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696170#comment-17696170
 ] 

Sourabh Badhya commented on HIVE-27019:
---

Thanks [~veghlaci05] , [~dkuzmenko] for the reviews.

> Split Cleaner into separate manageable modular entities
> ---
>
> Key: HIVE-27019
> URL: https://issues.apache.org/jira/browse/HIVE-27019
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 19.5h
>  Remaining Estimate: 0h
>
> As described by the parent task - 
> Cleaner can be divided into separate entities like -
> *1) Handler* - This entity fetches the data from the metastore DB from 
> relevant tables and converts it into a request entity called CleaningRequest. 
> It would also do SQL operations post cleanup (postprocess). Every type of 
> cleaning request is provided by a separate handler.
> *2) Filesystem remover* - This entity fetches the cleaning requests from 
> various handlers and deletes them according to the cleaning request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27019) Split Cleaner into separate manageable modular entities

2023-02-28 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya updated HIVE-27019:
--
Status: Patch Available  (was: In Progress)

> Split Cleaner into separate manageable modular entities
> ---
>
> Key: HIVE-27019
> URL: https://issues.apache.org/jira/browse/HIVE-27019
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 12h 40m
>  Remaining Estimate: 0h
>
> As described by the parent task - 
> Cleaner can be divided into separate entities like -
> *1) Handler* - This entity fetches the data from the metastore DB from 
> relevant tables and converts it into a request entity called CleaningRequest. 
> It would also do SQL operations post cleanup (postprocess). Every type of 
> cleaning request is provided by a separate handler.
> *2) Filesystem remover* - This entity fetches the cleaning requests from 
> various handlers and deletes them according to the cleaning request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (HIVE-27018) Move aborted transaction cleanup outside compaction process

2023-02-02 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-27018 started by Sourabh Badhya.
-
>  Move aborted transaction cleanup outside compaction process
> 
>
> Key: HIVE-27018
> URL: https://issues.apache.org/jira/browse/HIVE-27018
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>
> Aborted transactions processing is tightly integrated into the compaction 
> pipeline and consists of 3 main stages: Initiator, Compactor (Worker), 
> Cleaner. That could be simplified by doing all work on the Cleaner side.
> *Potential Benefits -* 
> There are major advantages of implementing this on the cleaner side - 
>  1) Currently an aborted txn in the TXNS table blocks the cleaning of 
> TXN_TO_WRITE_ID table since nothing gets cleaned above MIN(aborted txnid) in 
> the current implementation. After implementing this on the cleaner side, the 
> cleaner regularly checks and cleans the aborted records in the TXN_COMPONENTS 
> table, which in turn makes the AcidTxnCleanerService clean the aborted txns 
> in TXNS table.
>  2) Initiator and worker do not do anything on tables which contain only 
> aborted directories. It's the cleaner which removes the aborted directories 
> of the table. Hence all operations associated with the initiator and worker 
> for these tables are wasteful. These wasteful operations are avoided.
> 3) DP writes which are aborted are skipped by the worker currently. Hence 
> once again the cleaner is the one deleting the aborted directories. All 
> operations associated with the initiator and worker for this entry are 
> wasteful. These wasteful operations are avoided.
> *Proposed solution -* 
> *Implement logic to handle aborted transactions exclusively in Cleaner.*
> Implement logic to fetch the TXN_COMPONENTS which are associated with 
> transactions in aborted state and send the required information to the 
> cleaner. Cleaner must clean up the aborted deltas/delete deltas by using the 
> aborted directories in the AcidState of the table/partition.
> It is also better to separate entities which provide information of 
> compaction and abort cleanup to enhance code modularity. This can be done in 
> this way -
> Cleaner can be divided into separate entities like - 
> *1) Handler* - This entity fetches the data from the metastore DB from 
> relevant tables and converts it into a request entity called CleaningRequest. 
> It would also do SQL operations post cleanup (postprocess). Every type of 
> cleaning request is provided by a separate handler.
> *2) Filesystem remover* - This entity fetches the cleaning requests from 
> various handlers and deletes them according to the cleaning request.
> *This division allows for dynamic extensibility of cleanup from multiple 
> handlers. Every handler is responsible for providing cleaning requests from a 
> specific source.*
> The following solution is resilient i.e. in the event of abrupt metastore 
> shutdown, the cleaner can still see the relevant entries in the metastore DB 
> and retry the cleaning task for that entry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   >