[jira] [Updated] (HIVE-27980) Hive Iceberg Compaction: add support for OPTIMIZE TABLE syntax

2024-05-24 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-27980:
--
Description: 
Presently Hive Iceberg supports Major compaction using HIVE ACID syntax below.
{code:java}
ALTER TABLE name COMPACT MAJOR [AND WAIT] {code}
Add support for OPTIMIZE TABLE syntax. Example:
{code:java}
OPTIMIZE TABLE name REWRITE DATA

  future options support --- 
[USING BIN_PACK]
[ ( { FILE_SIZE_THRESHOLD | MIN_INPUT_FILES } =  [, ... ] ) ]
WHERE category = 'c1' {code}
This syntax will be inline with Impala.

Also, OPTIMIZE command is not limited to compaction, but also supports other 
table maintenance operations.

 

  was:
Presently Hive Iceberg supports Major compaction using HIVE ACID syntax below.
{code:java}
ALTER TABLE name COMPACT MAJOR [AND WAIT] {code}
Add support for OPTIMIZE TABLE syntax. Example:
{code:java}
OPTIMIZE TABLE name REWRITE DATA
  future options support --- 
[USING BIN_PACK]
[ ( { FILE_SIZE_THRESHOLD | MIN_INPUT_FILES } =  [, ... ] ) ]
WHERE category = 'c1' {code}
This syntax will be inline with Impala.

Also, OPTIMIZE command is not limited to compaction, but also supports other 
table maintenance operations.

 


> Hive Iceberg Compaction: add support for OPTIMIZE TABLE syntax
> --
>
> Key: HIVE-27980
> URL: https://issues.apache.org/jira/browse/HIVE-27980
> Project: Hive
>  Issue Type: New Feature
>Reporter: Dmitriy Fingerman
>Assignee: Dmitriy Fingerman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Presently Hive Iceberg supports Major compaction using HIVE ACID syntax below.
> {code:java}
> ALTER TABLE name COMPACT MAJOR [AND WAIT] {code}
> Add support for OPTIMIZE TABLE syntax. Example:
> {code:java}
> OPTIMIZE TABLE name REWRITE DATA
>   future options support --- 
> [USING BIN_PACK]
> [ ( { FILE_SIZE_THRESHOLD | MIN_INPUT_FILES } =  [, ... ] ) ]
> WHERE category = 'c1' {code}
> This syntax will be inline with Impala.
> Also, OPTIMIZE command is not limited to compaction, but also supports other 
> table maintenance operations.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27980) Hive Iceberg Compaction: add support for OPTIMIZE TABLE syntax

2024-05-24 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-27980:
--
Description: 
Presently Hive Iceberg supports Major compaction using HIVE ACID syntax below.
{code:java}
ALTER TABLE name COMPACT MAJOR [AND WAIT] {code}
Add support for OPTIMIZE TABLE syntax. Example:
{code:java}
OPTIMIZE TABLE name REWRITE DATA
  future options support --- 
[USING BIN_PACK]
[ ( { FILE_SIZE_THRESHOLD | MIN_INPUT_FILES } =  [, ... ] ) ]
WHERE category = 'c1' {code}
This syntax will be inline with Impala.

Also, OPTIMIZE command is not limited to compaction, but also supports other 
table maintenance operations.

 

  was:
Presently Hive Iceberg supports Major compaction using HIVE ACID syntax below.
{code:java}
ALTER TABLE name COMPACT MAJOR [AND WAIT] {code}
Add support for OPTIMIZE TABLE syntax. Example:
{code:java}
OPTIMIZE TABLE name REWRITE DATA
  future --- 
[USING BIN_PACK]
[ ( { FILE_SIZE_THRESHOLD | MIN_INPUT_FILES } =  [, ... ] ) ]
WHERE category = 'c1' {code}
This syntax will be inline with Impala.

Also, OPTIMIZE command is not limited to compaction, but also supports other 
table maintenance operations.

 


> Hive Iceberg Compaction: add support for OPTIMIZE TABLE syntax
> --
>
> Key: HIVE-27980
> URL: https://issues.apache.org/jira/browse/HIVE-27980
> Project: Hive
>  Issue Type: New Feature
>Reporter: Dmitriy Fingerman
>Assignee: Dmitriy Fingerman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Presently Hive Iceberg supports Major compaction using HIVE ACID syntax below.
> {code:java}
> ALTER TABLE name COMPACT MAJOR [AND WAIT] {code}
> Add support for OPTIMIZE TABLE syntax. Example:
> {code:java}
> OPTIMIZE TABLE name REWRITE DATA
>   future options support --- 
> [USING BIN_PACK]
> [ ( { FILE_SIZE_THRESHOLD | MIN_INPUT_FILES } =  [, ... ] ) ]
> WHERE category = 'c1' {code}
> This syntax will be inline with Impala.
> Also, OPTIMIZE command is not limited to compaction, but also supports other 
> table maintenance operations.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27980) Hive Iceberg Compaction: add support for OPTIMIZE TABLE syntax

2024-05-24 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-27980:
--
Description: 
Presently Hive Iceberg supports Major compaction using HIVE ACID syntax below.
{code:java}
ALTER TABLE name COMPACT MAJOR [AND WAIT] {code}
Add support for OPTIMIZE TABLE syntax. Example:
{code:java}
OPTIMIZE TABLE name REWRITE DATA
  future --- 
[USING BIN_PACK]
[ ( { FILE_SIZE_THRESHOLD | MIN_INPUT_FILES } =  [, ... ] ) ]
WHERE category = 'c1' {code}
This syntax will be inline with Impala.

Also, OPTIMIZE command is not limited to compaction, but also supports other 
table maintenance operations.

 

  was:
Presently Hive Iceberg supports Major compaction using HIVE ACID syntax below.
{code:java}
ALTER TABLE name COMPACT MAJOR [AND WAIT] {code}
Add support for OPTIMIZE TABLE syntax. Example:
{code:java}
OPTIMIZE TABLE name
REWRITE DATA [USING BIN_PACK]
[ ( { FILE_SIZE_THRESHOLD | MIN_INPUT_FILES } =  [, ... ] ) ]
WHERE category = 'c1' {code}
This syntax will be inline with Impala.

Also, OPTIMIZE command is not limited to compaction, but also supports other 
table maintenance operations.

 


> Hive Iceberg Compaction: add support for OPTIMIZE TABLE syntax
> --
>
> Key: HIVE-27980
> URL: https://issues.apache.org/jira/browse/HIVE-27980
> Project: Hive
>  Issue Type: New Feature
>Reporter: Dmitriy Fingerman
>Assignee: Dmitriy Fingerman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Presently Hive Iceberg supports Major compaction using HIVE ACID syntax below.
> {code:java}
> ALTER TABLE name COMPACT MAJOR [AND WAIT] {code}
> Add support for OPTIMIZE TABLE syntax. Example:
> {code:java}
> OPTIMIZE TABLE name REWRITE DATA
>   future --- 
> [USING BIN_PACK]
> [ ( { FILE_SIZE_THRESHOLD | MIN_INPUT_FILES } =  [, ... ] ) ]
> WHERE category = 'c1' {code}
> This syntax will be inline with Impala.
> Also, OPTIMIZE command is not limited to compaction, but also supports other 
> table maintenance operations.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HIVE-28277) HIVE does not support update operations for ICEBERG of type location_based_table.

2024-05-23 Thread yongzhi.shao (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848817#comment-17848817
 ] 

yongzhi.shao edited comment on HIVE-28277 at 5/24/24 2:50 AM:
--

I've updated the code and the problem has gone away. Thank you, sir.

This problem was fixed in HIVE-28069.


was (Author: lisoda):
I've updated the code and the problem has gone away. Thank you, sir.

> HIVE does not support update operations for ICEBERG of type 
> location_based_table.
> -
>
> Key: HIVE-28277
> URL: https://issues.apache.org/jira/browse/HIVE-28277
> Project: Hive
>  Issue Type: Bug
>  Components: Iceberg integration
>Affects Versions: 4.0.0
> Environment: ICEBERG:1.5.2
> HIVE 4.0.0
>Reporter: yongzhi.shao
>Priority: Major
> Fix For: 4.0.0
>
>
> Currently, when I update the location_based_table using hive, hive 
> incorrectly empties all data directories and metadata directories.
> After the update statement is executed, the iceberg table is corrupted.
>  
> {code:java}
> --spark 3.4.1 + iceberg 1.5.2:
> CREATE TABLE IF NOT EXISTS datacenter.default.test_data_04 (
> id string,name string
> )
> using iceberg
> PARTITIONED BY (name)
> TBLPROPERTIES 
> ('read.orc.vectorization.enabled'='true','write.format.default'='orc','write.orc.bloom.filter.columns'='id','write.orc.compression-codec'='zstd','write.metadata.previous-versions-max'='3','write.metadata.delete-after-commit.enabled'='true');
> insert into datacenter.default.test_data_04(id,name) 
> values('1','a'),('2','b');
> --hive4:
> CREATE EXTERNAL TABLE default.test_data_04
> STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
> LOCATION 'hdfs:///iceberg-catalog/warehouse/default/test_data_04'
> TBLPROPERTIES 
> ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true');
> select id,name from default.test_data_04; --2 row
> update test_data_04 set name = 'adasd' where id = '1';
> ERROR:
> 2024-05-23T10:26:32,028 ERROR [HiveServer2-Background-Pool: Thread-297] 
> hive.HiveIcebergStorageHandler: Error while trying to commit job: 
> job_17061635207991_169536, job_17061635207990_169536, 
> job_17061635207992_169536, starting rollback changes for table: 
> default.test_data_04
> org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist at 
> location: /iceberg-catalog/warehouse/default/test_data_04
> BEFORE UPDATE:
> ICEBERG TABLE DIR:
> [root@ ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
> Found 2 items
> drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
> /iceberg-catalog/warehouse/default/test_data_04/data
> drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
> /iceberg-catalog/warehouse/default/test_data_04/metadata
> AFTER UPDATE:
> ICEBERG TABLE DIR:
> [root@XXX ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
> Found 3 items
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_1
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_2
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_3
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27498) Support custom delimiter in SkippingTextInputFormat

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-27498:
--
Labels: pull-request-available  (was: )

> Support custom delimiter in SkippingTextInputFormat
> ---
>
> Key: HIVE-27498
> URL: https://issues.apache.org/jira/browse/HIVE-27498
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Taraka Rama Rao Lethavadla
>Assignee: Mayank Kunwar
>Priority: Major
>  Labels: pull-request-available
>
> Simple select is returning results as expected when there are configs
> {noformat}
> 'skip.header.line.count'='1',                    
> 'textinputformat.record.delimiter'='|'{noformat}
> but if we execute select count(*) or any query that launches a tez job is 
> considering the whole text as single line
> *Test case*
> data.csv
> {noformat}
> CodeName|A |B 
> |C  {noformat}
> DDL
> {noformat}
> create external table test(code string,name string)
> ROW FORMAT SERDE
>'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>  WITH SERDEPROPERTIES (
>'field.delim'='\t')
>  STORED AS INPUTFORMAT
>'org.apache.hadoop.mapred.TextInputFormat'
>  OUTPUTFORMAT
>'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>location '${system:test.tmp.dir}/test'
>  TBLPROPERTIES (
>'skip.header.line.count'='1',
>'textinputformat.record.delimiter'='|');{noformat}
> Query result
> select code,name from test;
> {noformat}
> A 
> B 
> 
> C {noformat}
> *Problem:* But query _+select count(*) from test+_  is returning 1 instead of 
> 3
> It used to work in older hive versions.
> The difference in behaviour started to happen after the introduction of 
> feature https://issues.apache.org/jira/browse/HIVE-21924
> The feature aims at splitting the text files while reading even though the 
> table has configuration to skip headers. There by increasing the number of 
> mappers to process the query there by improving throughput of the query.
> The actual problem lies in how new feature is reading a file. It does not 
> consider 'textinputformat.record.delimiter' property and tries to read the 
> file looking for new line characters. Since the input file does not have a 
> new line for every record, it is reading the whole file as single line and 
> count is returned as 1
> Ref: 
> [https://github.com/apache/hive/blob/24a82a65f96b65eeebe4e23b2fec425037a70216/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L548]
>  
>  *Workaround*
> If we can remove headers in the data and skip header config in table 
> properties or compress the files, then we will not get into this issue
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28279) Output the database name for SHOW EXTENDED TABLES statement

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-28279:
--
Labels: pull-request-available  (was: )

> Output the database name for SHOW EXTENDED TABLES statement
> ---
>
> Key: HIVE-28279
> URL: https://issues.apache.org/jira/browse/HIVE-28279
> Project: Hive
>  Issue Type: Task
>  Components: Hive
>Reporter: Wechar
>Assignee: Wechar
>Priority: Major
>  Labels: pull-request-available
>
> HIVE-21301 introduced {{SHOW EXTENDED TABLES}} statement which output table 
> name and table type while listing tables in a database.
> In this patch, we aim to add a new output filed for database name with 
> following reasons:
> 1. database name in {{SHOW EXTENDED TABLES}} statement is optional, output 
> the database is informal in this case.
> 2. when statistic table names and database names by this statement for list 
> of databases, the output result including database name is much more helpful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28278) Iceberg: Stats: IllegalStateException Invalid file: file length 0

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-28278:
--
Labels: pull-request-available  (was: )

> Iceberg: Stats: IllegalStateException Invalid file: file length 0
> -
>
> Key: HIVE-28278
> URL: https://issues.apache.org/jira/browse/HIVE-28278
> Project: Hive
>  Issue Type: Bug
>  Components: Iceberg integration
>Affects Versions: 4.0.0
>Reporter: Denys Kuzmenko
>Assignee: Denys Kuzmenko
>Priority: Major
>  Labels: pull-request-available
>
> BugFix, can happen when the stats file was already created but stats object 
> has not yet been written, and someone tried to read it.
> Why are the changes needed?
> {code}
> ERROR : FAILED: IllegalStateException Invalid file: file length 0 is less tha 
> minimal length of the footer tail 12
> java.lang.IllegalStateException: Invalid file: file length 0 is less tha 
> minimal length of the footer tail 12
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28279) Output the database name for SHOW EXTENDED TABLES statement

2024-05-23 Thread Wechar (Jira)
Wechar created HIVE-28279:
-

 Summary: Output the database name for SHOW EXTENDED TABLES 
statement
 Key: HIVE-28279
 URL: https://issues.apache.org/jira/browse/HIVE-28279
 Project: Hive
  Issue Type: Task
  Components: Hive
Reporter: Wechar
Assignee: Wechar


HIVE-21301 introduced {{SHOW EXTENDED TABLES}} statement which output table 
name and table type while listing tables in a database.

In this patch, we aim to add a new output filed for database name with 
following reasons:
1. database name in {{SHOW EXTENDED TABLES}} statement is optional, output the 
database is informal in this case.
2. when statistic table names and database names by this statement for list of 
databases, the output result including database name is much more helpful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28278) Iceberg: Stats: IllegalStateException Invalid file: file length 0

2024-05-23 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-28278:
--
Issue Type: Bug  (was: Task)

> Iceberg: Stats: IllegalStateException Invalid file: file length 0
> -
>
> Key: HIVE-28278
> URL: https://issues.apache.org/jira/browse/HIVE-28278
> Project: Hive
>  Issue Type: Bug
>Reporter: Denys Kuzmenko
>Assignee: Denys Kuzmenko
>Priority: Major
>
> BugFix, can happen when the stats file was already created but stats object 
> has not yet been written, and someone tried to read it.
> Why are the changes needed?
> {code}
> ERROR : FAILED: IllegalStateException Invalid file: file length 0 is less tha 
> minimal length of the footer tail 12
> java.lang.IllegalStateException: Invalid file: file length 0 is less tha 
> minimal length of the footer tail 12
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28278) Iceberg: Stats: IllegalStateException Invalid file: file length 0

2024-05-23 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-28278:
--
Affects Version/s: 4.0.0

> Iceberg: Stats: IllegalStateException Invalid file: file length 0
> -
>
> Key: HIVE-28278
> URL: https://issues.apache.org/jira/browse/HIVE-28278
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Denys Kuzmenko
>Assignee: Denys Kuzmenko
>Priority: Major
>
> BugFix, can happen when the stats file was already created but stats object 
> has not yet been written, and someone tried to read it.
> Why are the changes needed?
> {code}
> ERROR : FAILED: IllegalStateException Invalid file: file length 0 is less tha 
> minimal length of the footer tail 12
> java.lang.IllegalStateException: Invalid file: file length 0 is less tha 
> minimal length of the footer tail 12
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28278) Iceberg: Stats: IllegalStateException Invalid file: file length 0

2024-05-23 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-28278:
--
Component/s: Iceberg integration

> Iceberg: Stats: IllegalStateException Invalid file: file length 0
> -
>
> Key: HIVE-28278
> URL: https://issues.apache.org/jira/browse/HIVE-28278
> Project: Hive
>  Issue Type: Bug
>  Components: Iceberg integration
>Affects Versions: 4.0.0
>Reporter: Denys Kuzmenko
>Assignee: Denys Kuzmenko
>Priority: Major
>
> BugFix, can happen when the stats file was already created but stats object 
> has not yet been written, and someone tried to read it.
> Why are the changes needed?
> {code}
> ERROR : FAILED: IllegalStateException Invalid file: file length 0 is less tha 
> minimal length of the footer tail 12
> java.lang.IllegalStateException: Invalid file: file length 0 is less tha 
> minimal length of the footer tail 12
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28278) Iceberg: Stats: IllegalStateException Invalid file: file length 0

2024-05-23 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-28278:
--
Status: Patch Available  (was: Open)

> Iceberg: Stats: IllegalStateException Invalid file: file length 0
> -
>
> Key: HIVE-28278
> URL: https://issues.apache.org/jira/browse/HIVE-28278
> Project: Hive
>  Issue Type: Task
>Reporter: Denys Kuzmenko
>Assignee: Denys Kuzmenko
>Priority: Major
>
> BugFix, can happen when the stats file was already created but stats object 
> has not yet been written, and someone tried to read it.
> Why are the changes needed?
> {code}
> ERROR : FAILED: IllegalStateException Invalid file: file length 0 is less tha 
> minimal length of the footer tail 12
> java.lang.IllegalStateException: Invalid file: file length 0 is less tha 
> minimal length of the footer tail 12
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HIVE-28278) Iceberg: Stats: IllegalStateException Invalid file: file length 0

2024-05-23 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko reassigned HIVE-28278:
-

Assignee: Denys Kuzmenko

> Iceberg: Stats: IllegalStateException Invalid file: file length 0
> -
>
> Key: HIVE-28278
> URL: https://issues.apache.org/jira/browse/HIVE-28278
> Project: Hive
>  Issue Type: Task
>Reporter: Denys Kuzmenko
>Assignee: Denys Kuzmenko
>Priority: Major
>
> BugFix, can happen when the stats file was already created but stats object 
> has not yet been written, and someone tried to read it.
> Why are the changes needed?
> {code}
> ERROR : FAILED: IllegalStateException Invalid file: file length 0 is less tha 
> minimal length of the footer tail 12
> java.lang.IllegalStateException: Invalid file: file length 0 is less tha 
> minimal length of the footer tail 12
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28278) CDPD-70188

2024-05-23 Thread Denys Kuzmenko (Jira)
Denys Kuzmenko created HIVE-28278:
-

 Summary: CDPD-70188
 Key: HIVE-28278
 URL: https://issues.apache.org/jira/browse/HIVE-28278
 Project: Hive
  Issue Type: Task
Reporter: Denys Kuzmenko


BugFix, can happen when the stats file was already created but stats object has 
not yet been written, and someone tried to read it.

Why are the changes needed?
{code}
ERROR : FAILED: IllegalStateException Invalid file: file length 0 is less tha 
minimal length of the footer tail 12
java.lang.IllegalStateException: Invalid file: file length 0 is less tha 
minimal length of the footer tail 12
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28278) Iceberg: Stats: IllegalStateException Invalid file: file length 0

2024-05-23 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-28278:
--
Summary: Iceberg: Stats: IllegalStateException Invalid file: file length 0  
(was: CDPD-70188)

> Iceberg: Stats: IllegalStateException Invalid file: file length 0
> -
>
> Key: HIVE-28278
> URL: https://issues.apache.org/jira/browse/HIVE-28278
> Project: Hive
>  Issue Type: Task
>Reporter: Denys Kuzmenko
>Priority: Major
>
> BugFix, can happen when the stats file was already created but stats object 
> has not yet been written, and someone tried to read it.
> Why are the changes needed?
> {code}
> ERROR : FAILED: IllegalStateException Invalid file: file length 0 is less tha 
> minimal length of the footer tail 12
> java.lang.IllegalStateException: Invalid file: file length 0 is less tha 
> minimal length of the footer tail 12
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HIVE-27498) Support custom delimiter in SkippingTextInputFormat

2024-05-23 Thread Mayank Kunwar (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Kunwar reopened HIVE-27498:
--

The issue is hitting again, so reopening the ticket

> Support custom delimiter in SkippingTextInputFormat
> ---
>
> Key: HIVE-27498
> URL: https://issues.apache.org/jira/browse/HIVE-27498
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Taraka Rama Rao Lethavadla
>Priority: Major
>
> Simple select is returning results as expected when there are configs
> {noformat}
> 'skip.header.line.count'='1',                    
> 'textinputformat.record.delimiter'='|'{noformat}
> but if we execute select count(*) or any query that launches a tez job is 
> considering the whole text as single line
> *Test case*
> data.csv
> {noformat}
> CodeName|A |B 
> |C  {noformat}
> DDL
> {noformat}
> create external table test(code string,name string)
> ROW FORMAT SERDE
>'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>  WITH SERDEPROPERTIES (
>'field.delim'='\t')
>  STORED AS INPUTFORMAT
>'org.apache.hadoop.mapred.TextInputFormat'
>  OUTPUTFORMAT
>'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>location '${system:test.tmp.dir}/test'
>  TBLPROPERTIES (
>'skip.header.line.count'='1',
>'textinputformat.record.delimiter'='|');{noformat}
> Query result
> select code,name from test;
> {noformat}
> A 
> B 
> 
> C {noformat}
> *Problem:* But query _+select count(*) from test+_  is returning 1 instead of 
> 3
> It used to work in older hive versions.
> The difference in behaviour started to happen after the introduction of 
> feature https://issues.apache.org/jira/browse/HIVE-21924
> The feature aims at splitting the text files while reading even though the 
> table has configuration to skip headers. There by increasing the number of 
> mappers to process the query there by improving throughput of the query.
> The actual problem lies in how new feature is reading a file. It does not 
> consider 'textinputformat.record.delimiter' property and tries to read the 
> file looking for new line characters. Since the input file does not have a 
> new line for every record, it is reading the whole file as single line and 
> count is returned as 1
> Ref: 
> [https://github.com/apache/hive/blob/24a82a65f96b65eeebe4e23b2fec425037a70216/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L548]
>  
>  *Workaround*
> If we can remove headers in the data and skip header config in table 
> properties or compress the files, then we will not get into this issue
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HIVE-27498) Support custom delimiter in SkippingTextInputFormat

2024-05-23 Thread Mayank Kunwar (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Kunwar reassigned HIVE-27498:


Assignee: Mayank Kunwar

> Support custom delimiter in SkippingTextInputFormat
> ---
>
> Key: HIVE-27498
> URL: https://issues.apache.org/jira/browse/HIVE-27498
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Taraka Rama Rao Lethavadla
>Assignee: Mayank Kunwar
>Priority: Major
>
> Simple select is returning results as expected when there are configs
> {noformat}
> 'skip.header.line.count'='1',                    
> 'textinputformat.record.delimiter'='|'{noformat}
> but if we execute select count(*) or any query that launches a tez job is 
> considering the whole text as single line
> *Test case*
> data.csv
> {noformat}
> CodeName|A |B 
> |C  {noformat}
> DDL
> {noformat}
> create external table test(code string,name string)
> ROW FORMAT SERDE
>'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>  WITH SERDEPROPERTIES (
>'field.delim'='\t')
>  STORED AS INPUTFORMAT
>'org.apache.hadoop.mapred.TextInputFormat'
>  OUTPUTFORMAT
>'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>location '${system:test.tmp.dir}/test'
>  TBLPROPERTIES (
>'skip.header.line.count'='1',
>'textinputformat.record.delimiter'='|');{noformat}
> Query result
> select code,name from test;
> {noformat}
> A 
> B 
> 
> C {noformat}
> *Problem:* But query _+select count(*) from test+_  is returning 1 instead of 
> 3
> It used to work in older hive versions.
> The difference in behaviour started to happen after the introduction of 
> feature https://issues.apache.org/jira/browse/HIVE-21924
> The feature aims at splitting the text files while reading even though the 
> table has configuration to skip headers. There by increasing the number of 
> mappers to process the query there by improving throughput of the query.
> The actual problem lies in how new feature is reading a file. It does not 
> consider 'textinputformat.record.delimiter' property and tries to read the 
> file looking for new line characters. Since the input file does not have a 
> new line for every record, it is reading the whole file as single line and 
> count is returned as 1
> Ref: 
> [https://github.com/apache/hive/blob/24a82a65f96b65eeebe4e23b2fec425037a70216/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L548]
>  
>  *Workaround*
> If we can remove headers in the data and skip header config in table 
> properties or compress the files, then we will not get into this issue
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-28273) Test data generation failure in HIVE-28249 related tests

2024-05-23 Thread Stamatis Zampetakis (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stamatis Zampetakis resolved HIVE-28273.

Fix Version/s: 4.1.0
   Resolution: Fixed

Fixed in 
https://github.com/apache/hive/commit/019017d0909a17d6e85d519f5c3f4f52828fd509

Thanks for the PR [~Csaba]!

> Test data generation failure in HIVE-28249 related tests
> 
>
> Key: HIVE-28273
> URL: https://issues.apache.org/jira/browse/HIVE-28273
> Project: Hive
>  Issue Type: Bug
>Reporter: Csaba Juhász
>Assignee: Csaba Juhász
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
> Attachments: image-2024-05-22-19-11-35-890.png
>
>
> generateJulianLeapYearTimestamps and generateJulianLeapYearTimestamps28thFeb 
> are throwing NegativeArraySizeException once the base value equals or is over 
> 999
> This is caused by the below code, supplying a negative value (when digits 
> return a value larger than 4) to zeros, which in turn is used to create a new 
> char array.
> {code:java}
> StringBuilder sb = new StringBuilder(29);
> int year = ((i % ) + 1) * 100;
> sb.append(zeros(4 - digits(year)));
> {code}
> When the tests are run using maven, the error in the generation function is 
> caught but never rethrown or reported and  the build is reported successful. 
> For example running
> _TestParquetTimestampsHive2Compatibility#testWriteHive2ReadHive4UsingLegacyConversionWithJulianLeapYearsFor28thFeb_
>  has the result:
> {code:java}
> [INFO] ---
> [INFO]  T E S T S
> [INFO] ---
> [INFO] Running 
> org.apache.hadoop.hive.ql.io.parquet.serde.TestParquetTimestampsHive2Compatibility
> [INFO] Tests run: 396, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 0.723 s - in 
> org.apache.hadoop.hive.ql.io.parquet.serde.TestParquetTimestampsHive2Compatibility
> [INFO] 
> [INFO] Results:
> [INFO] 
> [INFO] Tests run: 396, Failures: 0, Errors: 0, Skipped: 0
> ...
> [INFO] BUILD SUCCESS
> {code}
> When the test is run through an IDE (eg VSCode), the failure is reported 
> properly.
>  !image-2024-05-22-19-11-35-890.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-28277) HIVE does not support update operations for ICEBERG of type location_based_table.

2024-05-22 Thread yongzhi.shao (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yongzhi.shao resolved HIVE-28277.
-
Fix Version/s: 4.0.0
   Resolution: Won't Fix

> HIVE does not support update operations for ICEBERG of type 
> location_based_table.
> -
>
> Key: HIVE-28277
> URL: https://issues.apache.org/jira/browse/HIVE-28277
> Project: Hive
>  Issue Type: Bug
>  Components: Iceberg integration
>Affects Versions: 4.0.0
> Environment: ICEBERG:1.5.2
> HIVE 4.0.0
>Reporter: yongzhi.shao
>Priority: Major
> Fix For: 4.0.0
>
>
> Currently, when I update the location_based_table using hive, hive 
> incorrectly empties all data directories and metadata directories.
> After the update statement is executed, the iceberg table is corrupted.
>  
> {code:java}
> --spark 3.4.1 + iceberg 1.5.2:
> CREATE TABLE IF NOT EXISTS datacenter.default.test_data_04 (
> id string,name string
> )
> using iceberg
> PARTITIONED BY (name)
> TBLPROPERTIES 
> ('read.orc.vectorization.enabled'='true','write.format.default'='orc','write.orc.bloom.filter.columns'='id','write.orc.compression-codec'='zstd','write.metadata.previous-versions-max'='3','write.metadata.delete-after-commit.enabled'='true');
> insert into datacenter.default.test_data_04(id,name) 
> values('1','a'),('2','b');
> --hive4:
> CREATE EXTERNAL TABLE default.test_data_04
> STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
> LOCATION 'hdfs:///iceberg-catalog/warehouse/default/test_data_04'
> TBLPROPERTIES 
> ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true');
> select id,name from default.test_data_04; --2 row
> update test_data_04 set name = 'adasd' where id = '1';
> ERROR:
> 2024-05-23T10:26:32,028 ERROR [HiveServer2-Background-Pool: Thread-297] 
> hive.HiveIcebergStorageHandler: Error while trying to commit job: 
> job_17061635207991_169536, job_17061635207990_169536, 
> job_17061635207992_169536, starting rollback changes for table: 
> default.test_data_04
> org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist at 
> location: /iceberg-catalog/warehouse/default/test_data_04
> BEFORE UPDATE:
> ICEBERG TABLE DIR:
> [root@ ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
> Found 2 items
> drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
> /iceberg-catalog/warehouse/default/test_data_04/data
> drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
> /iceberg-catalog/warehouse/default/test_data_04/metadata
> AFTER UPDATE:
> ICEBERG TABLE DIR:
> [root@XXX ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
> Found 3 items
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_1
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_2
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_3
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28277) HIVE does not support update operations for ICEBERG of type location_based_table.

2024-05-22 Thread yongzhi.shao (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848817#comment-17848817
 ] 

yongzhi.shao commented on HIVE-28277:
-

我更新了代码,问题确实消失了.谢谢

> HIVE does not support update operations for ICEBERG of type 
> location_based_table.
> -
>
> Key: HIVE-28277
> URL: https://issues.apache.org/jira/browse/HIVE-28277
> Project: Hive
>  Issue Type: Bug
>  Components: Iceberg integration
>Affects Versions: 4.0.0
> Environment: ICEBERG:1.5.2
> HIVE 4.0.0
>Reporter: yongzhi.shao
>Priority: Major
>
> Currently, when I update the location_based_table using hive, hive 
> incorrectly empties all data directories and metadata directories.
> After the update statement is executed, the iceberg table is corrupted.
>  
> {code:java}
> --spark 3.4.1 + iceberg 1.5.2:
> CREATE TABLE IF NOT EXISTS datacenter.default.test_data_04 (
> id string,name string
> )
> using iceberg
> PARTITIONED BY (name)
> TBLPROPERTIES 
> ('read.orc.vectorization.enabled'='true','write.format.default'='orc','write.orc.bloom.filter.columns'='id','write.orc.compression-codec'='zstd','write.metadata.previous-versions-max'='3','write.metadata.delete-after-commit.enabled'='true');
> insert into datacenter.default.test_data_04(id,name) 
> values('1','a'),('2','b');
> --hive4:
> CREATE EXTERNAL TABLE default.test_data_04
> STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
> LOCATION 'hdfs:///iceberg-catalog/warehouse/default/test_data_04'
> TBLPROPERTIES 
> ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true');
> select id,name from default.test_data_04; --2 row
> update test_data_04 set name = 'adasd' where id = '1';
> ERROR:
> 2024-05-23T10:26:32,028 ERROR [HiveServer2-Background-Pool: Thread-297] 
> hive.HiveIcebergStorageHandler: Error while trying to commit job: 
> job_17061635207991_169536, job_17061635207990_169536, 
> job_17061635207992_169536, starting rollback changes for table: 
> default.test_data_04
> org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist at 
> location: /iceberg-catalog/warehouse/default/test_data_04
> BEFORE UPDATE:
> ICEBERG TABLE DIR:
> [root@ ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
> Found 2 items
> drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
> /iceberg-catalog/warehouse/default/test_data_04/data
> drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
> /iceberg-catalog/warehouse/default/test_data_04/metadata
> AFTER UPDATE:
> ICEBERG TABLE DIR:
> [root@XXX ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
> Found 3 items
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_1
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_2
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_3
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HIVE-28277) HIVE does not support update operations for ICEBERG of type location_based_table.

2024-05-22 Thread yongzhi.shao (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848817#comment-17848817
 ] 

yongzhi.shao edited comment on HIVE-28277 at 5/23/24 5:15 AM:
--

I've updated the code and the problem has gone away. Thank you, sir.


was (Author: lisoda):
我更新了代码,问题确实消失了.谢谢

> HIVE does not support update operations for ICEBERG of type 
> location_based_table.
> -
>
> Key: HIVE-28277
> URL: https://issues.apache.org/jira/browse/HIVE-28277
> Project: Hive
>  Issue Type: Bug
>  Components: Iceberg integration
>Affects Versions: 4.0.0
> Environment: ICEBERG:1.5.2
> HIVE 4.0.0
>Reporter: yongzhi.shao
>Priority: Major
>
> Currently, when I update the location_based_table using hive, hive 
> incorrectly empties all data directories and metadata directories.
> After the update statement is executed, the iceberg table is corrupted.
>  
> {code:java}
> --spark 3.4.1 + iceberg 1.5.2:
> CREATE TABLE IF NOT EXISTS datacenter.default.test_data_04 (
> id string,name string
> )
> using iceberg
> PARTITIONED BY (name)
> TBLPROPERTIES 
> ('read.orc.vectorization.enabled'='true','write.format.default'='orc','write.orc.bloom.filter.columns'='id','write.orc.compression-codec'='zstd','write.metadata.previous-versions-max'='3','write.metadata.delete-after-commit.enabled'='true');
> insert into datacenter.default.test_data_04(id,name) 
> values('1','a'),('2','b');
> --hive4:
> CREATE EXTERNAL TABLE default.test_data_04
> STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
> LOCATION 'hdfs:///iceberg-catalog/warehouse/default/test_data_04'
> TBLPROPERTIES 
> ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true');
> select id,name from default.test_data_04; --2 row
> update test_data_04 set name = 'adasd' where id = '1';
> ERROR:
> 2024-05-23T10:26:32,028 ERROR [HiveServer2-Background-Pool: Thread-297] 
> hive.HiveIcebergStorageHandler: Error while trying to commit job: 
> job_17061635207991_169536, job_17061635207990_169536, 
> job_17061635207992_169536, starting rollback changes for table: 
> default.test_data_04
> org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist at 
> location: /iceberg-catalog/warehouse/default/test_data_04
> BEFORE UPDATE:
> ICEBERG TABLE DIR:
> [root@ ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
> Found 2 items
> drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
> /iceberg-catalog/warehouse/default/test_data_04/data
> drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
> /iceberg-catalog/warehouse/default/test_data_04/metadata
> AFTER UPDATE:
> ICEBERG TABLE DIR:
> [root@XXX ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
> Found 3 items
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_1
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_2
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_3
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28277) HIVE does not support update operations for ICEBERG of type location_based_table.

2024-05-22 Thread Butao Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848814#comment-17848814
 ] 

Butao Zhang commented on HIVE-28277:


I didn't reproduce this issue on hive4/master. maybe some other env problem...

> HIVE does not support update operations for ICEBERG of type 
> location_based_table.
> -
>
> Key: HIVE-28277
> URL: https://issues.apache.org/jira/browse/HIVE-28277
> Project: Hive
>  Issue Type: Bug
>  Components: Iceberg integration
>Affects Versions: 4.0.0
> Environment: ICEBERG:1.5.2
> HIVE 4.0.0
>Reporter: yongzhi.shao
>Priority: Major
>
> Currently, when I update the location_based_table using hive, hive 
> incorrectly empties all data directories and metadata directories.
> After the update statement is executed, the iceberg table is corrupted.
>  
> {code:java}
> --spark 3.4.1 + iceberg 1.5.2:
> CREATE TABLE IF NOT EXISTS datacenter.default.test_data_04 (
> id string,name string
> )
> using iceberg
> PARTITIONED BY (name)
> TBLPROPERTIES 
> ('read.orc.vectorization.enabled'='true','write.format.default'='orc','write.orc.bloom.filter.columns'='id','write.orc.compression-codec'='zstd','write.metadata.previous-versions-max'='3','write.metadata.delete-after-commit.enabled'='true');
> insert into datacenter.default.test_data_04(id,name) 
> values('1','a'),('2','b');
> --hive4:
> CREATE EXTERNAL TABLE default.test_data_04
> STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
> LOCATION 'hdfs:///iceberg-catalog/warehouse/default/test_data_04'
> TBLPROPERTIES 
> ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true');
> select id,name from default.test_data_04; --2 row
> update test_data_04 set name = 'adasd' where id = '1';
> ERROR:
> 2024-05-23T10:26:32,028 ERROR [HiveServer2-Background-Pool: Thread-297] 
> hive.HiveIcebergStorageHandler: Error while trying to commit job: 
> job_17061635207991_169536, job_17061635207990_169536, 
> job_17061635207992_169536, starting rollback changes for table: 
> default.test_data_04
> org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist at 
> location: /iceberg-catalog/warehouse/default/test_data_04
> BEFORE UPDATE:
> ICEBERG TABLE DIR:
> [root@ ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
> Found 2 items
> drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
> /iceberg-catalog/warehouse/default/test_data_04/data
> drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
> /iceberg-catalog/warehouse/default/test_data_04/metadata
> AFTER UPDATE:
> ICEBERG TABLE DIR:
> [root@XXX ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
> Found 3 items
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_1
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_2
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_3
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28277) HIVE does not support update operations for ICEBERG of type location_based_table.

2024-05-22 Thread yongzhi.shao (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yongzhi.shao updated HIVE-28277:

Description: 
Currently, when I update the location_based_table using hive, hive incorrectly 
empties all data directories and metadata directories.

After the update statement is executed, the iceberg table is corrupted.

 
{code:java}
--spark 3.4.1 + iceberg 1.5.2:
CREATE TABLE IF NOT EXISTS datacenter.default.test_data_04 (
id string,name string
)
using iceberg
PARTITIONED BY (name)
TBLPROPERTIES 
('read.orc.vectorization.enabled'='true','write.format.default'='orc','write.orc.bloom.filter.columns'='id','write.orc.compression-codec'='zstd','write.metadata.previous-versions-max'='3','write.metadata.delete-after-commit.enabled'='true');

insert into datacenter.default.test_data_04(id,name) values('1','a'),('2','b');

--hive4:
CREATE EXTERNAL TABLE default.test_data_04
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
LOCATION 'hdfs:///iceberg-catalog/warehouse/default/test_data_04'
TBLPROPERTIES 
('iceberg.catalog'='location_based_table','engine.hive.enabled'='true');

select id,name from default.test_data_04; --2 row

update test_data_04 set name = 'adasd' where id = '1';

ERROR:
2024-05-23T10:26:32,028 ERROR [HiveServer2-Background-Pool: Thread-297] 
hive.HiveIcebergStorageHandler: Error while trying to commit job: 
job_17061635207991_169536, job_17061635207990_169536, 
job_17061635207992_169536, starting rollback changes for table: 
default.test_data_04
org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist at 
location: /iceberg-catalog/warehouse/default/test_data_04


BEFORE UPDATE:
ICEBERG TABLE DIR:
[root@ ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
Found 2 items
drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
/iceberg-catalog/warehouse/default/test_data_04/data
drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
/iceberg-catalog/warehouse/default/test_data_04/metadata


AFTER UPDATE:
ICEBERG TABLE DIR:

[root@XXX ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
Found 3 items
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_1
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_2
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_3


{code}
 

 

  was:
Currently, when I update the location_based_table using hive, hive incorrectly 
empties all data directories and metadata directories.

After the update statement is executed, the iceberg table is corrupted.

 
{code:java}
--spark 3.4.1 + iceberg 1.5.2:
CREATE TABLE IF NOT EXISTS datacenter.default.test_data_04 (
id string,name string
)
using iceberg
PARTITIONED BY (name)
TBLPROPERTIES 
('read.orc.vectorization.enabled'='true','write.format.default'='orc','write.orc.bloom.filter.columns'='id','write.orc.compression-codec'='zstd','write.metadata.previous-versions-max'='3','write.metadata.delete-after-commit.enabled'='true');

insert into datacenter.default.test_data_04(id,name) values('1','a'),('2','b');

--hive4:
CREATE EXTERNAL TABLE default.test_data_04
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
LOCATION 'hdfs:///iceberg-catalog/warehouse/default/test_data_04'
TBLPROPERTIES 
('iceberg.catalog'='location_based_table','engine.hive.enabled'='true');

select distinct id,name from (select id,name from default.test_data_04 limit 
10) s1; --2 row

update test_data_04 set name = 'adasd' where id = '1';

ERROR:
2024-05-23T10:26:32,028 ERROR [HiveServer2-Background-Pool: Thread-297] 
hive.HiveIcebergStorageHandler: Error while trying to commit job: 
job_17061635207991_169536, job_17061635207990_169536, 
job_17061635207992_169536, starting rollback changes for table: 
default.test_data_04
org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist at 
location: /iceberg-catalog/warehouse/default/test_data_04


BEFORE UPDATE:
ICEBERG TABLE DIR:
[root@ ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
Found 2 items
drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
/iceberg-catalog/warehouse/default/test_data_04/data
drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
/iceberg-catalog/warehouse/default/test_data_04/metadata


AFTER UPDATE:
ICEBERG TABLE DIR:

[root@XXX ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
Found 3 items
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_1
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_2
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04

[jira] [Updated] (HIVE-28277) HIVE does not support update operations for ICEBERG of type location_based_table.

2024-05-22 Thread yongzhi.shao (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yongzhi.shao updated HIVE-28277:

Issue Type: Bug  (was: Improvement)

> HIVE does not support update operations for ICEBERG of type 
> location_based_table.
> -
>
> Key: HIVE-28277
> URL: https://issues.apache.org/jira/browse/HIVE-28277
> Project: Hive
>  Issue Type: Bug
>  Components: Iceberg integration
>Affects Versions: 4.0.0
> Environment: ICEBERG:1.5.2
> HIVE 4.0.0
>Reporter: yongzhi.shao
>Priority: Major
>
> Currently, when I update the location_based_table using hive, hive 
> incorrectly empties all data directories and metadata directories.
> After the update statement is executed, the iceberg table is corrupted.
>  
> {code:java}
> --spark 3.4.1 + iceberg 1.5.2:
> CREATE TABLE IF NOT EXISTS datacenter.default.test_data_04 (
> id string,name string
> )
> using iceberg
> PARTITIONED BY (name)
> TBLPROPERTIES 
> ('read.orc.vectorization.enabled'='true','write.format.default'='orc','write.orc.bloom.filter.columns'='id','write.orc.compression-codec'='zstd','write.metadata.previous-versions-max'='3','write.metadata.delete-after-commit.enabled'='true');
> insert into datacenter.default.test_data_04(id,name) 
> values('1','a'),('2','b');
> --hive4:
> CREATE EXTERNAL TABLE default.test_data_04
> STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
> LOCATION 'hdfs:///iceberg-catalog/warehouse/default/test_data_04'
> TBLPROPERTIES 
> ('iceberg.catalog'='location_based_table','engine.hive.enabled'='true');
> select distinct id,name from (select id,name from default.test_data_04 limit 
> 10) s1; --2 row
> update test_data_04 set name = 'adasd' where id = '1';
> ERROR:
> 2024-05-23T10:26:32,028 ERROR [HiveServer2-Background-Pool: Thread-297] 
> hive.HiveIcebergStorageHandler: Error while trying to commit job: 
> job_17061635207991_169536, job_17061635207990_169536, 
> job_17061635207992_169536, starting rollback changes for table: 
> default.test_data_04
> org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist at 
> location: /iceberg-catalog/warehouse/default/test_data_04
> BEFORE UPDATE:
> ICEBERG TABLE DIR:
> [root@ ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
> Found 2 items
> drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
> /iceberg-catalog/warehouse/default/test_data_04/data
> drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
> /iceberg-catalog/warehouse/default/test_data_04/metadata
> AFTER UPDATE:
> ICEBERG TABLE DIR:
> [root@XXX ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
> Found 3 items
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_1
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_2
> drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
> /iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_3
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28277) HIVE does not support update operations for ICEBERG of type location_based_table.

2024-05-22 Thread yongzhi.shao (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yongzhi.shao updated HIVE-28277:

Description: 
Currently, when I update the location_based_table using hive, hive incorrectly 
empties all data directories and metadata directories.

After the update statement is executed, the iceberg table is corrupted.

 
{code:java}
--spark 3.4.1 + iceberg 1.5.2:
CREATE TABLE IF NOT EXISTS datacenter.default.test_data_04 (
id string,name string
)
using iceberg
PARTITIONED BY (name)
TBLPROPERTIES 
('read.orc.vectorization.enabled'='true','write.format.default'='orc','write.orc.bloom.filter.columns'='id','write.orc.compression-codec'='zstd','write.metadata.previous-versions-max'='3','write.metadata.delete-after-commit.enabled'='true');

insert into datacenter.default.test_data_04(id,name) values('1','a'),('2','b');

--hive4:
CREATE EXTERNAL TABLE default.test_data_04
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
LOCATION 'hdfs:///iceberg-catalog/warehouse/default/test_data_04'
TBLPROPERTIES 
('iceberg.catalog'='location_based_table','engine.hive.enabled'='true');

select distinct id,name from (select id,name from default.test_data_04 limit 
10) s1; --2 row

update test_data_04 set name = 'adasd' where id = '1';

ERROR:
2024-05-23T10:26:32,028 ERROR [HiveServer2-Background-Pool: Thread-297] 
hive.HiveIcebergStorageHandler: Error while trying to commit job: 
job_17061635207991_169536, job_17061635207990_169536, 
job_17061635207992_169536, starting rollback changes for table: 
default.test_data_04
org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist at 
location: /iceberg-catalog/warehouse/default/test_data_04


BEFORE UPDATE:
ICEBERG TABLE DIR:
[root@ ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
Found 2 items
drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
/iceberg-catalog/warehouse/default/test_data_04/data
drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
/iceberg-catalog/warehouse/default/test_data_04/metadata


AFTER UPDATE:
ICEBERG TABLE DIR:

[root@XXX ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
Found 3 items
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_1
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_2
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_3


{code}
 

 

  was:
Currently, when I update the location_based_table using hive, hive incorrectly 
empties all data directories and metadata directories.

After the update statement is executed, the iceberg table is corrupted.

 
{code:java}
--spark:
CREATE TABLE IF NOT EXISTS datacenter.default.test_data_04 (
id string,name string
)
using iceberg
PARTITIONED BY (name)
TBLPROPERTIES 
('read.orc.vectorization.enabled'='true','write.format.default'='orc','write.orc.bloom.filter.columns'='id','write.orc.compression-codec'='zstd','write.metadata.previous-versions-max'='3','write.metadata.delete-after-commit.enabled'='true');

insert into datacenter.default.test_data_04(id,name) values('1','a'),('2','b');

--hive4:
CREATE EXTERNAL TABLE default.test_data_04
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
LOCATION 'hdfs:///iceberg-catalog/warehouse/default/test_data_04'
TBLPROPERTIES 
('iceberg.catalog'='location_based_table','engine.hive.enabled'='true');

select distinct id,name from (select id,name from default.test_data_04 limit 
10) s1; --2 row

update test_data_04 set name = 'adasd' where id = '1';

ERROR:
2024-05-23T10:26:32,028 ERROR [HiveServer2-Background-Pool: Thread-297] 
hive.HiveIcebergStorageHandler: Error while trying to commit job: 
job_17061635207991_169536, job_17061635207990_169536, 
job_17061635207992_169536, starting rollback changes for table: 
default.test_data_04
org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist at 
location: /iceberg-catalog/warehouse/default/test_data_04


BEFORE UPDATE:
ICEBERG TABLE DIR:
[root@ ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
Found 2 items
drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
/iceberg-catalog/warehouse/default/test_data_04/data
drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
/iceberg-catalog/warehouse/default/test_data_04/metadata


AFTER UPDATE:
ICEBERG TABLE DIR:

[root@XXX ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
Found 3 items
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_1
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_2
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse

[jira] [Updated] (HIVE-28277) HIVE does not support update operations for ICEBERG of type location_based_table.

2024-05-22 Thread yongzhi.shao (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yongzhi.shao updated HIVE-28277:

Description: 
Currently, when I update the location_based_table using hive, hive incorrectly 
empties all data directories and metadata directories.

After the update statement is executed, the iceberg table is corrupted.

 
{code:java}
--spark:
CREATE TABLE IF NOT EXISTS datacenter.default.test_data_04 (
id string,name string
)
using iceberg
PARTITIONED BY (name)
TBLPROPERTIES 
('read.orc.vectorization.enabled'='true','write.format.default'='orc','write.orc.bloom.filter.columns'='id','write.orc.compression-codec'='zstd','write.metadata.previous-versions-max'='3','write.metadata.delete-after-commit.enabled'='true');

insert into datacenter.default.test_data_04(id,name) values('1','a'),('2','b');

--hive4:
CREATE EXTERNAL TABLE default.test_data_04
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
LOCATION 'hdfs:///iceberg-catalog/warehouse/default/test_data_04'
TBLPROPERTIES 
('iceberg.catalog'='location_based_table','engine.hive.enabled'='true');

select distinct id,name from (select id,name from default.test_data_04 limit 
10) s1; --2 row

update test_data_04 set name = 'adasd' where id = '1';

ERROR:
2024-05-23T10:26:32,028 ERROR [HiveServer2-Background-Pool: Thread-297] 
hive.HiveIcebergStorageHandler: Error while trying to commit job: 
job_17061635207991_169536, job_17061635207990_169536, 
job_17061635207992_169536, starting rollback changes for table: 
default.test_data_04
org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist at 
location: /iceberg-catalog/warehouse/default/test_data_04


BEFORE UPDATE:
ICEBERG TABLE DIR:
[root@ ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
Found 2 items
drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
/iceberg-catalog/warehouse/default/test_data_04/data
drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
/iceberg-catalog/warehouse/default/test_data_04/metadata


AFTER UPDATE:
ICEBERG TABLE DIR:

[root@XXX ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
Found 3 items
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_1
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_2
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_3


{code}
 

 

  was:
Currently, when I update the location_based_table using hive, hive incorrectly 
empties all data directories and metadata directories.

 

 
{code:java}
--spark:
CREATE TABLE IF NOT EXISTS datacenter.default.test_data_04 (
id string,name string
)
using iceberg
PARTITIONED BY (name)
TBLPROPERTIES 
('read.orc.vectorization.enabled'='true','write.format.default'='orc','write.orc.bloom.filter.columns'='id','write.orc.compression-codec'='zstd','write.metadata.previous-versions-max'='3','write.metadata.delete-after-commit.enabled'='true');

insert into datacenter.default.test_data_04(id,name) values('1','a'),('2','b');

--hive4:
CREATE EXTERNAL TABLE default.test_data_04
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
LOCATION 'hdfs:///iceberg-catalog/warehouse/default/test_data_04'
TBLPROPERTIES 
('iceberg.catalog'='location_based_table','engine.hive.enabled'='true');

select distinct id,name from (select id,name from default.test_data_04 limit 
10) s1; --2 row

update test_data_04 set name = 'adasd' where id = '1';

ERROR:
2024-05-23T10:26:32,028 ERROR [HiveServer2-Background-Pool: Thread-297] 
hive.HiveIcebergStorageHandler: Error while trying to commit job: 
job_17061635207991_169536, job_17061635207990_169536, 
job_17061635207992_169536, starting rollback changes for table: 
default.test_data_04
org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist at 
location: /iceberg-catalog/warehouse/default/test_data_04


BEFORE UPDATE:
ICEBERG TABLE DIR:
[root@ ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
Found 2 items
drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
/iceberg-catalog/warehouse/default/test_data_04/data
drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
/iceberg-catalog/warehouse/default/test_data_04/metadata


AFTER UPDATE:
ICEBERG TABLE DIR:

[root@XXX ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
Found 3 items
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_1
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_2
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_3


{code}
 

 


> HIVE does not support upd

[jira] [Created] (HIVE-28277) HIVE does not support update operations for ICEBERG of type location_based_table.

2024-05-22 Thread yongzhi.shao (Jira)
yongzhi.shao created HIVE-28277:
---

 Summary: HIVE does not support update operations for ICEBERG of 
type location_based_table.
 Key: HIVE-28277
 URL: https://issues.apache.org/jira/browse/HIVE-28277
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Affects Versions: 4.0.0
 Environment: ICEBERG:1.5.2

HIVE 4.0.0
Reporter: yongzhi.shao


Currently, when I update the location_based_table using hive, hive incorrectly 
empties all data directories and metadata directories.

 

 
{code:java}
--spark:
CREATE TABLE IF NOT EXISTS datacenter.default.test_data_04 (
id string,name string
)
using iceberg
PARTITIONED BY (name)
TBLPROPERTIES 
('read.orc.vectorization.enabled'='true','write.format.default'='orc','write.orc.bloom.filter.columns'='id','write.orc.compression-codec'='zstd','write.metadata.previous-versions-max'='3','write.metadata.delete-after-commit.enabled'='true');

insert into datacenter.default.test_data_04(id,name) values('1','a'),('2','b');

--hive4:
CREATE EXTERNAL TABLE default.test_data_04
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
LOCATION 'hdfs:///iceberg-catalog/warehouse/default/test_data_04'
TBLPROPERTIES 
('iceberg.catalog'='location_based_table','engine.hive.enabled'='true');

select distinct id,name from (select id,name from default.test_data_04 limit 
10) s1; --2 row

update test_data_04 set name = 'adasd' where id = '1';

ERROR:
2024-05-23T10:26:32,028 ERROR [HiveServer2-Background-Pool: Thread-297] 
hive.HiveIcebergStorageHandler: Error while trying to commit job: 
job_17061635207991_169536, job_17061635207990_169536, 
job_17061635207992_169536, starting rollback changes for table: 
default.test_data_04
org.apache.iceberg.exceptions.NoSuchTableException: Table does not exist at 
location: /iceberg-catalog/warehouse/default/test_data_04


BEFORE UPDATE:
ICEBERG TABLE DIR:
[root@ ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
Found 2 items
drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
/iceberg-catalog/warehouse/default/test_data_04/data
drwxr-xr-x   - hive hdfs          0 2024-05-23 09:26 
/iceberg-catalog/warehouse/default/test_data_04/metadata


AFTER UPDATE:
ICEBERG TABLE DIR:

[root@XXX ~]# hdfs dfs -ls /iceberg-catalog/warehouse/default/test_data_04
Found 3 items
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_1
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_2
drwxr-xr-x   - hive hdfs          0 2024-05-23 10:26 
/iceberg-catalog/warehouse/default/test_data_04/-tmp.HIVE_UNION_SUBDIR_3


{code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28276) Iceberg: Make Iceberg split threads configurable when table scanning

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-28276:
--
Labels: pull-request-available  (was: )

> Iceberg: Make Iceberg split threads configurable when table scanning
> 
>
> Key: HIVE-28276
> URL: https://issues.apache.org/jira/browse/HIVE-28276
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Reporter: Butao Zhang
>Assignee: Butao Zhang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28276) Iceberg: Make Iceberg split threads configurable when table scanning

2024-05-22 Thread Butao Zhang (Jira)
Butao Zhang created HIVE-28276:
--

 Summary: Iceberg: Make Iceberg split threads configurable when 
table scanning
 Key: HIVE-28276
 URL: https://issues.apache.org/jira/browse/HIVE-28276
 Project: Hive
  Issue Type: Improvement
  Components: Iceberg integration
Reporter: Butao Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HIVE-28276) Iceberg: Make Iceberg split threads configurable when table scanning

2024-05-22 Thread Butao Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Butao Zhang reassigned HIVE-28276:
--

Assignee: Butao Zhang

> Iceberg: Make Iceberg split threads configurable when table scanning
> 
>
> Key: HIVE-28276
> URL: https://issues.apache.org/jira/browse/HIVE-28276
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Reporter: Butao Zhang
>Assignee: Butao Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HIVE-25351) stddev(), stddev_pop() with CBO enable returning null

2024-05-22 Thread Jiandan Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  reassigned HIVE-25351:


Assignee: Jiandan Yang   (was: Dayakar M)

> stddev(), stddev_pop() with CBO enable returning null
> -
>
> Key: HIVE-25351
> URL: https://issues.apache.org/jira/browse/HIVE-25351
> Project: Hive
>  Issue Type: Bug
>Reporter: Ashish Sharma
>Assignee: Jiandan Yang 
>Priority: Blocker
>  Labels: pull-request-available
>
> *script used to repro*
> create table cbo_test (key string, v1 double, v2 decimal(30,2), v3 
> decimal(30,2));
> insert into cbo_test values ("00140006375905", 10230.72, 
> 10230.72, 10230.69), ("00140006375905", 10230.72, 10230.72, 
> 10230.69), ("00140006375905", 10230.72, 10230.72, 10230.69), 
> ("00140006375905", 10230.72, 10230.72, 10230.69), 
> ("00140006375905", 10230.72, 10230.72, 10230.69), 
> ("00140006375905", 10230.72, 10230.72, 10230.69);
> select stddev(v1), stddev(v2), stddev(v3) from cbo_test;
> *Enable CBO*
> ++
> |  Explain   |
> ++
> | Plan optimized by CBO. |
> ||
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)|
> ||
> | Stage-0|
> |   Fetch Operator   |
> | limit:-1   |
> | Stage-1|
> |   Reducer 2 vectorized |
> |   File Output Operator [FS_13] |
> | Select Operator [SEL_12] (rows=1 width=24) |
> |   Output:["_col0","_col1","_col2"] |
> |   Group By Operator [GBY_11] (rows=1 width=72) |
> | 
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"],aggregations:["sum(VALUE._col0)","sum(VALUE._col1)","count(VALUE._col2)","sum(VALUE._col3)","sum(VALUE._col4)","count(VALUE._col5)","sum(VALUE._col6)","sum(VALUE._col7)","count(VALUE._col8)"]
>  |
> |   <-Map 1 [CUSTOM_SIMPLE_EDGE] vectorized  |
> | PARTITION_ONLY_SHUFFLE [RS_10] |
> |   Group By Operator [GBY_9] (rows=1 width=72) |
> | 
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"],aggregations:["sum(_col3)","sum(_col0)","count(_col0)","sum(_col5)","sum(_col4)","count(_col1)","sum(_col7)","sum(_col6)","count(_col2)"]
>  |
> | Select Operator [SEL_8] (rows=6 width=232) |
> |   
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"] |
> |   TableScan [TS_0] (rows=6 width=232) |
> | default@cbo_test,cbo_test, ACID 
> table,Tbl:COMPLETE,Col:COMPLETE,Output:["v1","v2","v3"] |
> ||
> ++
> *Query Result* 
> _c0   _c1 _c2
> 0.0   NaN NaN
> *Disable CBO*
> ++
> |  Explain   |
> ++
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)|
> ||
> | Stage-0|
> |   Fetch Operator   |
> | limit:-1   |
> | Stage-1|
> |   Reducer 2 vectorized |
> |   File Output Operator [FS_11]

[jira] [Updated] (HIVE-28274) Iceberg: Add support for 'If Not Exists' and 'or Replace' for Create Branch

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-28274:
--
Labels: pull-request-available  (was: )

> Iceberg: Add support for 'If Not Exists' and 'or Replace' for Create Branch
> ---
>
> Key: HIVE-28274
> URL: https://issues.apache.org/jira/browse/HIVE-28274
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
>
> Add support for 
> {noformat}
> -- CREATE audit-branch at current snapshot with default retention if it 
> doesn't exist.
> ALTER TABLE prod.db.sample CREATE BRANCH IF NOT EXISTS `audit-branch`
> -- CREATE audit-branch at current snapshot with default retention or REPLACE 
> it if it already exists.
> ALTER TABLE prod.db.sample CREATE OR REPLACE BRANCH `audit-branch`{noformat}
> Like Spark:
> https://iceberg.apache.org/docs/1.5.1/spark-ddl/#branching-and-tagging-ddl



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28274) Iceberg: Add support for 'If Not Exists' and 'or Replace' for Create Branch

2024-05-22 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HIVE-28274:

Summary: Iceberg: Add support for 'If Not Exists' and 'or Replace' for 
Create Branch  (was: Iceberg: Add support for 'If Not Exists" and 'or Replace' 
for Create Branch)

> Iceberg: Add support for 'If Not Exists' and 'or Replace' for Create Branch
> ---
>
> Key: HIVE-28274
> URL: https://issues.apache.org/jira/browse/HIVE-28274
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>
> Add support for 
> {noformat}
> -- CREATE audit-branch at current snapshot with default retention if it 
> doesn't exist.
> ALTER TABLE prod.db.sample CREATE BRANCH IF NOT EXISTS `audit-branch`
> -- CREATE audit-branch at current snapshot with default retention or REPLACE 
> it if it already exists.
> ALTER TABLE prod.db.sample CREATE OR REPLACE BRANCH `audit-branch`{noformat}
> Like Spark:
> https://iceberg.apache.org/docs/1.5.1/spark-ddl/#branching-and-tagging-ddl



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28275) Iceberg: Add support for 'If Not Exists" and 'or Replace' for Create Tag

2024-05-22 Thread Ayush Saxena (Jira)
Ayush Saxena created HIVE-28275:
---

 Summary: Iceberg: Add support for 'If Not Exists" and 'or Replace' 
for Create Tag 
 Key: HIVE-28275
 URL: https://issues.apache.org/jira/browse/HIVE-28275
 Project: Hive
  Issue Type: Sub-task
Reporter: Ayush Saxena
Assignee: Ayush Saxena


Add support for If not exists and Or Replace while creating Tags
{noformat}
-- CREATE historical-tag at current snapshot with default retention if it 
doesn't exist.
ALTER TABLE prod.db.sample CREATE TAG IF NOT EXISTS `historical-tag`

-- CREATE historical-tag at current snapshot with default retention or REPLACE 
it if it already exists.
ALTER TABLE prod.db.sample CREATE OR REPLACE TAG `historical-tag`{noformat}
Like Spark:

https://iceberg.apache.org/docs/1.5.1/spark-ddl/#alter-table-create-branch



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28274) Iceberg: Add support for 'If Not Exists" and 'or Replace' for Create Branch

2024-05-22 Thread Ayush Saxena (Jira)
Ayush Saxena created HIVE-28274:
---

 Summary: Iceberg: Add support for 'If Not Exists" and 'or Replace' 
for Create Branch
 Key: HIVE-28274
 URL: https://issues.apache.org/jira/browse/HIVE-28274
 Project: Hive
  Issue Type: Sub-task
Reporter: Ayush Saxena
Assignee: Ayush Saxena


Add support for 
{noformat}
-- CREATE audit-branch at current snapshot with default retention if it doesn't 
exist.
ALTER TABLE prod.db.sample CREATE BRANCH IF NOT EXISTS `audit-branch`

-- CREATE audit-branch at current snapshot with default retention or REPLACE it 
if it already exists.
ALTER TABLE prod.db.sample CREATE OR REPLACE BRANCH `audit-branch`{noformat}
Like Spark:

https://iceberg.apache.org/docs/1.5.1/spark-ddl/#branching-and-tagging-ddl



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (HIVE-28273) Test data generation failure in HIVE-28249 related tests

2024-05-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-28273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-28273 started by Csaba Juhász.
---
> Test data generation failure in HIVE-28249 related tests
> 
>
> Key: HIVE-28273
> URL: https://issues.apache.org/jira/browse/HIVE-28273
> Project: Hive
>  Issue Type: Bug
>Reporter: Csaba Juhász
>Assignee: Csaba Juhász
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-05-22-19-11-35-890.png
>
>
> generateJulianLeapYearTimestamps and generateJulianLeapYearTimestamps28thFeb 
> are throwing NegativeArraySizeException once the base value equals or is over 
> 999
> This is caused by the below code, supplying a negative value (when digits 
> return a value larger than 4) to zeros, which in turn is used to create a new 
> char array.
> {code:java}
> StringBuilder sb = new StringBuilder(29);
> int year = ((i % ) + 1) * 100;
> sb.append(zeros(4 - digits(year)));
> {code}
> When the tests are run using maven, the error in the generation function is 
> caught but never rethrown or reported and  the build is reported successful. 
> For example running
> _TestParquetTimestampsHive2Compatibility#testWriteHive2ReadHive4UsingLegacyConversionWithJulianLeapYearsFor28thFeb_
>  has the result:
> {code:java}
> [INFO] ---
> [INFO]  T E S T S
> [INFO] ---
> [INFO] Running 
> org.apache.hadoop.hive.ql.io.parquet.serde.TestParquetTimestampsHive2Compatibility
> [INFO] Tests run: 396, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 0.723 s - in 
> org.apache.hadoop.hive.ql.io.parquet.serde.TestParquetTimestampsHive2Compatibility
> [INFO] 
> [INFO] Results:
> [INFO] 
> [INFO] Tests run: 396, Failures: 0, Errors: 0, Skipped: 0
> ...
> [INFO] BUILD SUCCESS
> {code}
> When the test is run through an IDE (eg VSCode), the failure is reported 
> properly.
>  !image-2024-05-22-19-11-35-890.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HIVE-28273) Test data generation failure in HIVE-28249 related tests

2024-05-22 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-28273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Juhász reassigned HIVE-28273:
---

Assignee: Csaba Juhász

> Test data generation failure in HIVE-28249 related tests
> 
>
> Key: HIVE-28273
> URL: https://issues.apache.org/jira/browse/HIVE-28273
> Project: Hive
>  Issue Type: Bug
>Reporter: Csaba Juhász
>Assignee: Csaba Juhász
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-05-22-19-11-35-890.png
>
>
> generateJulianLeapYearTimestamps and generateJulianLeapYearTimestamps28thFeb 
> are throwing NegativeArraySizeException once the base value equals or is over 
> 999
> This is caused by the below code, supplying a negative value (when digits 
> return a value larger than 4) to zeros, which in turn is used to create a new 
> char array.
> {code:java}
> StringBuilder sb = new StringBuilder(29);
> int year = ((i % ) + 1) * 100;
> sb.append(zeros(4 - digits(year)));
> {code}
> When the tests are run using maven, the error in the generation function is 
> caught but never rethrown or reported and  the build is reported successful. 
> For example running
> _TestParquetTimestampsHive2Compatibility#testWriteHive2ReadHive4UsingLegacyConversionWithJulianLeapYearsFor28thFeb_
>  has the result:
> {code:java}
> [INFO] ---
> [INFO]  T E S T S
> [INFO] ---
> [INFO] Running 
> org.apache.hadoop.hive.ql.io.parquet.serde.TestParquetTimestampsHive2Compatibility
> [INFO] Tests run: 396, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 0.723 s - in 
> org.apache.hadoop.hive.ql.io.parquet.serde.TestParquetTimestampsHive2Compatibility
> [INFO] 
> [INFO] Results:
> [INFO] 
> [INFO] Tests run: 396, Failures: 0, Errors: 0, Skipped: 0
> ...
> [INFO] BUILD SUCCESS
> {code}
> When the test is run through an IDE (eg VSCode), the failure is reported 
> properly.
>  !image-2024-05-22-19-11-35-890.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28273) Test data generation failure in HIVE-28249 related tests

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-28273:
--
Labels: pull-request-available  (was: )

> Test data generation failure in HIVE-28249 related tests
> 
>
> Key: HIVE-28273
> URL: https://issues.apache.org/jira/browse/HIVE-28273
> Project: Hive
>  Issue Type: Bug
>Reporter: Csaba Juhász
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-05-22-19-11-35-890.png
>
>
> generateJulianLeapYearTimestamps and generateJulianLeapYearTimestamps28thFeb 
> are throwing NegativeArraySizeException once the base value equals or is over 
> 999
> This is caused by the below code, supplying a negative value (when digits 
> return a value larger than 4) to zeros, which in turn is used to create a new 
> char array.
> {code:java}
> StringBuilder sb = new StringBuilder(29);
> int year = ((i % ) + 1) * 100;
> sb.append(zeros(4 - digits(year)));
> {code}
> When the tests are run using maven, the error in the generation function is 
> caught but never rethrown or reported and  the build is reported successful. 
> For example running
> _TestParquetTimestampsHive2Compatibility#testWriteHive2ReadHive4UsingLegacyConversionWithJulianLeapYearsFor28thFeb_
>  has the result:
> {code:java}
> [INFO] ---
> [INFO]  T E S T S
> [INFO] ---
> [INFO] Running 
> org.apache.hadoop.hive.ql.io.parquet.serde.TestParquetTimestampsHive2Compatibility
> [INFO] Tests run: 396, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
> 0.723 s - in 
> org.apache.hadoop.hive.ql.io.parquet.serde.TestParquetTimestampsHive2Compatibility
> [INFO] 
> [INFO] Results:
> [INFO] 
> [INFO] Tests run: 396, Failures: 0, Errors: 0, Skipped: 0
> ...
> [INFO] BUILD SUCCESS
> {code}
> When the test is run through an IDE (eg VSCode), the failure is reported 
> properly.
>  !image-2024-05-22-19-11-35-890.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28270) Fix missing partition paths bug on drop_database

2024-05-22 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848691#comment-17848691
 ] 

Ayush Saxena commented on HIVE-28270:
-

Committed to master.

Thanx [~wechar] for the contribution!!!

> Fix missing partition paths  bug on drop_database
> -
>
> Key: HIVE-28270
> URL: https://issues.apache.org/jira/browse/HIVE-28270
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Wechar
>Assignee: Wechar
>Priority: Major
>  Labels: pull-request-available
>
> In {{HMSHandler#drop_database_core}}, it needs to collect all partition paths 
> that were not in the subdirectory of the table path, but now it only fetch 
> the last batch of paths.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28270) Fix missing partition paths bug on drop_database

2024-05-22 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HIVE-28270:

Labels: hive-4.0.1-must pull-request-available  (was: 
pull-request-available)

> Fix missing partition paths  bug on drop_database
> -
>
> Key: HIVE-28270
> URL: https://issues.apache.org/jira/browse/HIVE-28270
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Wechar
>Assignee: Wechar
>Priority: Major
>  Labels: hive-4.0.1-must, pull-request-available
> Fix For: 4.1.0
>
>
> In {{HMSHandler#drop_database_core}}, it needs to collect all partition paths 
> that were not in the subdirectory of the table path, but now it only fetch 
> the last batch of paths.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-28270) Fix missing partition paths bug on drop_database

2024-05-22 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena resolved HIVE-28270.
-
Fix Version/s: 4.1.0
   Resolution: Fixed

> Fix missing partition paths  bug on drop_database
> -
>
> Key: HIVE-28270
> URL: https://issues.apache.org/jira/browse/HIVE-28270
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Wechar
>Assignee: Wechar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> In {{HMSHandler#drop_database_core}}, it needs to collect all partition paths 
> that were not in the subdirectory of the table path, but now it only fetch 
> the last batch of paths.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28271) DirectSql fails for AlterPartitions

2024-05-22 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HIVE-28271:

Labels: hive-4.0.1-must pull-request-available  (was: 
pull-request-available)

> DirectSql fails for AlterPartitions
> ---
>
> Key: HIVE-28271
> URL: https://issues.apache.org/jira/browse/HIVE-28271
> Project: Hive
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: hive-4.0.1-must, pull-request-available
> Fix For: 4.1.0
>
>
> It fails at three places: (Misses Database Which Uses CLOB & Missing Boolean 
> type conversions Checks
> *First:*
> {noformat}
> 2024-05-21T08:50:16,570  WARN [main] metastore.ObjectStore: Falling back to 
> ORM path due to direct SQL failure (this is not an error): 
> java.lang.ClassCastException: org.apache.derby.impl.jdbc.EmbedClob cannot be 
> cast to java.lang.String at 
> org.apache.hadoop.hive.metastore.ExceptionHandler.newMetaException(ExceptionHandler.java:152)
>  at org.apache.hadoop.hive.metastore.Batchable.runBatched(Batchable.java:92) 
> at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.getParams(DirectSqlUpdatePart.java:748)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateParamTableInBatch(DirectSqlUpdatePart.java:715)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.alterPartitions(DirectSqlUpdatePart.java:636)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.alterPartitions(MetaStoreDirectSql.java:599)
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore$20.getSqlResult(ObjectStore.java:5371);
> {noformat}
> *Second:*
> {noformat}
> 2024-05-21T09:14:36,808  WARN [main] metastore.ObjectStore: Falling back to 
> ORM path due to direct SQL failure (this is not an error): 
> java.lang.ClassCastException: org.apache.derby.impl.jdbc.EmbedClob cannot be 
> cast to java.lang.String at 
> org.apache.hadoop.hive.metastore.ExceptionHandler.newMetaException(ExceptionHandler.java:152)
>  at org.apache.hadoop.hive.metastore.Batchable.runBatched(Batchable.java:92) 
> at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateCDInBatch(DirectSqlUpdatePart.java:1228)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateStorageDescriptorInBatch(DirectSqlUpdatePart.java:888)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.alterPartitions(DirectSqlUpdatePart.java:638)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.alterPartitions(MetaStoreDirectSql.java:599)
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore$20.getSqlResult(ObjectStore.java:5371);{noformat}
> *Third: Missing Boolean check type*
> {noformat}
> 2024-05-21T09:35:44,063  WARN [main] metastore.ObjectStore: Falling back to 
> ORM path due to direct SQL failure (this is not an error): 
> java.sql.BatchUpdateException: A truncation error was encountered trying to 
> shrink CHAR 'false' to length 1. at 
> org.apache.hadoop.hive.metastore.ExceptionHandler.newMetaException(ExceptionHandler.java:152)
>  at org.apache.hadoop.hive.metastore.Batchable.runBatched(Batchable.java:92) 
> at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.lambda$updateSDInBatch$16(DirectSqlUpdatePart.java:926)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateWithStatement(DirectSqlUpdatePart.java:656)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateSDInBatch(DirectSqlUpdatePart.java:926)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateStorageDescriptorInBatch(DirectSqlUpdatePart.java:900)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.alterPartitions(DirectSqlUpdatePart.java:638)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.alterPartitions(MetaStoreDirectSql.java:599)
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore$20.getSqlResult(ObjectStore.java:5371);
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-28271) DirectSql fails for AlterPartitions

2024-05-22 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena resolved HIVE-28271.
-
Fix Version/s: 4.1.0
   Resolution: Fixed

> DirectSql fails for AlterPartitions
> ---
>
> Key: HIVE-28271
> URL: https://issues.apache.org/jira/browse/HIVE-28271
> Project: Hive
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> It fails at three places: (Misses Database Which Uses CLOB & Missing Boolean 
> type conversions Checks
> *First:*
> {noformat}
> 2024-05-21T08:50:16,570  WARN [main] metastore.ObjectStore: Falling back to 
> ORM path due to direct SQL failure (this is not an error): 
> java.lang.ClassCastException: org.apache.derby.impl.jdbc.EmbedClob cannot be 
> cast to java.lang.String at 
> org.apache.hadoop.hive.metastore.ExceptionHandler.newMetaException(ExceptionHandler.java:152)
>  at org.apache.hadoop.hive.metastore.Batchable.runBatched(Batchable.java:92) 
> at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.getParams(DirectSqlUpdatePart.java:748)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateParamTableInBatch(DirectSqlUpdatePart.java:715)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.alterPartitions(DirectSqlUpdatePart.java:636)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.alterPartitions(MetaStoreDirectSql.java:599)
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore$20.getSqlResult(ObjectStore.java:5371);
> {noformat}
> *Second:*
> {noformat}
> 2024-05-21T09:14:36,808  WARN [main] metastore.ObjectStore: Falling back to 
> ORM path due to direct SQL failure (this is not an error): 
> java.lang.ClassCastException: org.apache.derby.impl.jdbc.EmbedClob cannot be 
> cast to java.lang.String at 
> org.apache.hadoop.hive.metastore.ExceptionHandler.newMetaException(ExceptionHandler.java:152)
>  at org.apache.hadoop.hive.metastore.Batchable.runBatched(Batchable.java:92) 
> at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateCDInBatch(DirectSqlUpdatePart.java:1228)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateStorageDescriptorInBatch(DirectSqlUpdatePart.java:888)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.alterPartitions(DirectSqlUpdatePart.java:638)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.alterPartitions(MetaStoreDirectSql.java:599)
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore$20.getSqlResult(ObjectStore.java:5371);{noformat}
> *Third: Missing Boolean check type*
> {noformat}
> 2024-05-21T09:35:44,063  WARN [main] metastore.ObjectStore: Falling back to 
> ORM path due to direct SQL failure (this is not an error): 
> java.sql.BatchUpdateException: A truncation error was encountered trying to 
> shrink CHAR 'false' to length 1. at 
> org.apache.hadoop.hive.metastore.ExceptionHandler.newMetaException(ExceptionHandler.java:152)
>  at org.apache.hadoop.hive.metastore.Batchable.runBatched(Batchable.java:92) 
> at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.lambda$updateSDInBatch$16(DirectSqlUpdatePart.java:926)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateWithStatement(DirectSqlUpdatePart.java:656)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateSDInBatch(DirectSqlUpdatePart.java:926)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateStorageDescriptorInBatch(DirectSqlUpdatePart.java:900)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.alterPartitions(DirectSqlUpdatePart.java:638)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.alterPartitions(MetaStoreDirectSql.java:599)
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore$20.getSqlResult(ObjectStore.java:5371);
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28271) DirectSql fails for AlterPartitions

2024-05-22 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848690#comment-17848690
 ] 

Ayush Saxena commented on HIVE-28271:
-

Committed to master.

Thanx [~zhangbutao] & [~wechar] for the review!!

> DirectSql fails for AlterPartitions
> ---
>
> Key: HIVE-28271
> URL: https://issues.apache.org/jira/browse/HIVE-28271
> Project: Hive
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
>
> It fails at three places: (Misses Database Which Uses CLOB & Missing Boolean 
> type conversions Checks
> *First:*
> {noformat}
> 2024-05-21T08:50:16,570  WARN [main] metastore.ObjectStore: Falling back to 
> ORM path due to direct SQL failure (this is not an error): 
> java.lang.ClassCastException: org.apache.derby.impl.jdbc.EmbedClob cannot be 
> cast to java.lang.String at 
> org.apache.hadoop.hive.metastore.ExceptionHandler.newMetaException(ExceptionHandler.java:152)
>  at org.apache.hadoop.hive.metastore.Batchable.runBatched(Batchable.java:92) 
> at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.getParams(DirectSqlUpdatePart.java:748)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateParamTableInBatch(DirectSqlUpdatePart.java:715)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.alterPartitions(DirectSqlUpdatePart.java:636)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.alterPartitions(MetaStoreDirectSql.java:599)
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore$20.getSqlResult(ObjectStore.java:5371);
> {noformat}
> *Second:*
> {noformat}
> 2024-05-21T09:14:36,808  WARN [main] metastore.ObjectStore: Falling back to 
> ORM path due to direct SQL failure (this is not an error): 
> java.lang.ClassCastException: org.apache.derby.impl.jdbc.EmbedClob cannot be 
> cast to java.lang.String at 
> org.apache.hadoop.hive.metastore.ExceptionHandler.newMetaException(ExceptionHandler.java:152)
>  at org.apache.hadoop.hive.metastore.Batchable.runBatched(Batchable.java:92) 
> at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateCDInBatch(DirectSqlUpdatePart.java:1228)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateStorageDescriptorInBatch(DirectSqlUpdatePart.java:888)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.alterPartitions(DirectSqlUpdatePart.java:638)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.alterPartitions(MetaStoreDirectSql.java:599)
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore$20.getSqlResult(ObjectStore.java:5371);{noformat}
> *Third: Missing Boolean check type*
> {noformat}
> 2024-05-21T09:35:44,063  WARN [main] metastore.ObjectStore: Falling back to 
> ORM path due to direct SQL failure (this is not an error): 
> java.sql.BatchUpdateException: A truncation error was encountered trying to 
> shrink CHAR 'false' to length 1. at 
> org.apache.hadoop.hive.metastore.ExceptionHandler.newMetaException(ExceptionHandler.java:152)
>  at org.apache.hadoop.hive.metastore.Batchable.runBatched(Batchable.java:92) 
> at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.lambda$updateSDInBatch$16(DirectSqlUpdatePart.java:926)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateWithStatement(DirectSqlUpdatePart.java:656)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateSDInBatch(DirectSqlUpdatePart.java:926)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateStorageDescriptorInBatch(DirectSqlUpdatePart.java:900)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.alterPartitions(DirectSqlUpdatePart.java:638)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.alterPartitions(MetaStoreDirectSql.java:599)
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore$20.getSqlResult(ObjectStore.java:5371);
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28273) Test data generation failure in HIVE-28249 related tests

2024-05-22 Thread Jira
Csaba Juhász created HIVE-28273:
---

 Summary: Test data generation failure in HIVE-28249 related tests
 Key: HIVE-28273
 URL: https://issues.apache.org/jira/browse/HIVE-28273
 Project: Hive
  Issue Type: Bug
Reporter: Csaba Juhász
 Attachments: image-2024-05-22-19-11-35-890.png

generateJulianLeapYearTimestamps and generateJulianLeapYearTimestamps28thFeb 
are throwing NegativeArraySizeException once the base value equals or is over 
999

This is caused by the below code, supplying a negative value (when digits 
return a value larger than 4) to zeros, which in turn is used to create a new 
char array.

{code:java}
StringBuilder sb = new StringBuilder(29);
int year = ((i % ) + 1) * 100;
sb.append(zeros(4 - digits(year)));
{code}

When the tests are run using maven, the error in the generation function is 
caught but never rethrown or reported and  the build is reported successful. 
For example running
_TestParquetTimestampsHive2Compatibility#testWriteHive2ReadHive4UsingLegacyConversionWithJulianLeapYearsFor28thFeb_
 has the result:


{code:java}
[INFO] ---
[INFO]  T E S T S
[INFO] ---
[INFO] Running 
org.apache.hadoop.hive.ql.io.parquet.serde.TestParquetTimestampsHive2Compatibility
[INFO] Tests run: 396, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.723 
s - in 
org.apache.hadoop.hive.ql.io.parquet.serde.TestParquetTimestampsHive2Compatibility
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 396, Failures: 0, Errors: 0, Skipped: 0

...

[INFO] BUILD SUCCESS
{code}

When the test is run through an IDE (eg VSCode), the failure is reported 
properly.

 !image-2024-05-22-19-11-35-890.png! 




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-28246) Fix confusing log message in LlapTaskSchedulerService

2024-05-22 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena resolved HIVE-28246.
-
Fix Version/s: 4.1.0
   Resolution: Fixed

> Fix confusing log message in LlapTaskSchedulerService
> -
>
> Key: HIVE-28246
> URL: https://issues.apache.org/jira/browse/HIVE-28246
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: Zoltán Rátkai
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 4.1.0
>
>
> https://github.com/apache/hive/blob/8415527101432bb5bf14b3c2a318a2cc40801b9a/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java#L1719
> {code}
>   WM_LOG.info("Registering " + taskInfo.attemptId + "; " + 
> taskInfo.isGuaranteed);
> {code}
> leads to a message like:
> {code}
> Registering attempt_1714730410273_0009_153_05_000235_10; false
> {code}
> "false" is out of any context, supposed to be something like:
> {code}
> Registering attempt_1714730410273_0009_153_05_000235_10, guaranteed: false
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28246) Fix confusing log message in LlapTaskSchedulerService

2024-05-22 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848645#comment-17848645
 ] 

Ayush Saxena commented on HIVE-28246:
-

Committed to master.

Thanx [~zratkai] for the contribution & [~aturoczy] for the review!!!

> Fix confusing log message in LlapTaskSchedulerService
> -
>
> Key: HIVE-28246
> URL: https://issues.apache.org/jira/browse/HIVE-28246
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: Zoltán Rátkai
>Priority: Major
>  Labels: newbie, pull-request-available
>
> https://github.com/apache/hive/blob/8415527101432bb5bf14b3c2a318a2cc40801b9a/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java#L1719
> {code}
>   WM_LOG.info("Registering " + taskInfo.attemptId + "; " + 
> taskInfo.isGuaranteed);
> {code}
> leads to a message like:
> {code}
> Registering attempt_1714730410273_0009_153_05_000235_10; false
> {code}
> "false" is out of any context, supposed to be something like:
> {code}
> Registering attempt_1714730410273_0009_153_05_000235_10, guaranteed: false
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28246) Fix confusing log message in LlapTaskSchedulerService

2024-05-22 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HIVE-28246:

Summary: Fix confusing log message in LlapTaskSchedulerService  (was: 
Confusing log messages in LlapTaskScheduler)

> Fix confusing log message in LlapTaskSchedulerService
> -
>
> Key: HIVE-28246
> URL: https://issues.apache.org/jira/browse/HIVE-28246
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: Zoltán Rátkai
>Priority: Major
>  Labels: newbie, pull-request-available
>
> https://github.com/apache/hive/blob/8415527101432bb5bf14b3c2a318a2cc40801b9a/llap-tez/src/java/org/apache/hadoop/hive/llap/tezplugins/LlapTaskSchedulerService.java#L1719
> {code}
>   WM_LOG.info("Registering " + taskInfo.attemptId + "; " + 
> taskInfo.isGuaranteed);
> {code}
> leads to a message like:
> {code}
> Registering attempt_1714730410273_0009_153_05_000235_10; false
> {code}
> "false" is out of any context, supposed to be something like:
> {code}
> Registering attempt_1714730410273_0009_153_05_000235_10, guaranteed: false
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-25974) Drop HiveFilterMergeRule and use FilterMergeRule from Calcite

2024-05-22 Thread Stamatis Zampetakis (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stamatis Zampetakis resolved HIVE-25974.

Fix Version/s: Not Applicable
   Resolution: Duplicate

> Drop HiveFilterMergeRule and use FilterMergeRule from Calcite
> -
>
> Key: HIVE-25974
> URL: https://issues.apache.org/jira/browse/HIVE-25974
> Project: Hive
>  Issue Type: Improvement
>  Components: CBO
>Affects Versions: 4.0.0
>Reporter: Alessandro Solimando
>Priority: Major
> Fix For: Not Applicable
>
>
> HiveFilterMergeRule is a copy of FilterMergeRule which was needed since the 
> latter did not simplify/flatten before creating the merged filter.
> This behaviour has been fixed in CALCITE-3982 (released since 1.23), so it 
> seems that the Hive rule could be removed now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HIVE-22633) GroupByOperator may throw NullPointerException when setting data skew optimization parameters

2024-05-22 Thread Butao Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848545#comment-17848545
 ] 

Butao Zhang edited comment on HIVE-22633 at 5/22/24 10:46 AM:
--

Update: if you are using Hive3, you can try to patch this HIVE-27712 as a 
hotfix, which is easier to test than HIVE-23530.

Hive4 does not have this issue as we don't use this udaf since HIVE-23530 .


was (Author: zhangbutao):
Update: if you are using Hive3, you can try to patch this HIVE-27712 as a 
hotfix, which is easier to test than HIVE-23530.

> GroupByOperator may throw NullPointerException when setting data skew 
> optimization parameters
> -
>
> Key: HIVE-22633
> URL: https://issues.apache.org/jira/browse/HIVE-22633
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.1.1, 4.0.0
>Reporter: Butao Zhang
>Assignee: Butao Zhang
>Priority: Major
>
> if hive.map.aggr and hive.groupby.skewindata set true,exception will be 
> thrown.
> step to repro:
> 1. create table: 
> set hive.map.aggr=true;
> set hive.groupby.skewindata=true;
> create table test1 (id1 bigint);
> create table test2 (id2 bigint) partitioned by(dt2 string);
> insert into test2 partition(dt2='2020') select a.id1 from test1 a group by 
> a.id1;
> 2.NullPointerException:
> {code:java}
> ], TaskAttempt 2 failed, info=[Error: Error while running task ( failure ) : 
> attempt_1585641455670_0001_2_03_00_2:java.lang.RuntimeException: 
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
> at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
> at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFComputeStats$GenericUDAFNumericStatsEvaluator.init(GenericUDAFComputeStats.java:373)
> at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.initializeOp(GroupByOperator.java:373)
> at 
> org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:360)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:191)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:266)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-22633) GroupByOperator may throw NullPointerException when setting data skew optimization parameters

2024-05-22 Thread Butao Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848545#comment-17848545
 ] 

Butao Zhang commented on HIVE-22633:


Update: if you are using Hive3, you can try to patch this HIVE-27712 as a 
hotfix, which is easier to test than HIVE-23530.

> GroupByOperator may throw NullPointerException when setting data skew 
> optimization parameters
> -
>
> Key: HIVE-22633
> URL: https://issues.apache.org/jira/browse/HIVE-22633
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.1.1, 4.0.0
>Reporter: Butao Zhang
>Assignee: Butao Zhang
>Priority: Major
>
> if hive.map.aggr and hive.groupby.skewindata set true,exception will be 
> thrown.
> step to repro:
> 1. create table: 
> set hive.map.aggr=true;
> set hive.groupby.skewindata=true;
> create table test1 (id1 bigint);
> create table test2 (id2 bigint) partitioned by(dt2 string);
> insert into test2 partition(dt2='2020') select a.id1 from test1 a group by 
> a.id1;
> 2.NullPointerException:
> {code:java}
> ], TaskAttempt 2 failed, info=[Error: Error while running task ( failure ) : 
> attempt_1585641455670_0001_2_03_00_2:java.lang.RuntimeException: 
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
> at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
> at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFComputeStats$GenericUDAFNumericStatsEvaluator.init(GenericUDAFComputeStats.java:373)
> at 
> org.apache.hadoop.hive.ql.exec.GroupByOperator.initializeOp(GroupByOperator.java:373)
> at 
> org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:360)
> at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:191)
> at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:266)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28272) Support setting per-session S3 credentials in Warehouse

2024-05-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-28272:
--
Labels: pull-request-available  (was: )

> Support setting per-session S3 credentials in Warehouse
> ---
>
> Key: HIVE-28272
> URL: https://issues.apache.org/jira/browse/HIVE-28272
> Project: Hive
>  Issue Type: Improvement
>Reporter: Butao Zhang
>Assignee: Butao Zhang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HIVE-28272) Support setting per-session S3 credentials in Warehouse

2024-05-22 Thread Butao Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Butao Zhang reassigned HIVE-28272:
--

Assignee: Butao Zhang

> Support setting per-session S3 credentials in Warehouse
> ---
>
> Key: HIVE-28272
> URL: https://issues.apache.org/jira/browse/HIVE-28272
> Project: Hive
>  Issue Type: Improvement
>Reporter: Butao Zhang
>Assignee: Butao Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HIVE-28268) Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false

2024-05-22 Thread Butao Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Butao Zhang reassigned HIVE-28268:
--

Assignee: Butao Zhang

> Iceberg: Retrieve row count from iceberg SnapshotSummary in case of 
> iceberg.hive.keep.stats=false
> -
>
> Key: HIVE-28268
> URL: https://issues.apache.org/jira/browse/HIVE-28268
> Project: Hive
>  Issue Type: Task
>  Components: Iceberg integration
>Reporter: Butao Zhang
>Assignee: Butao Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28272) Support setting per-session S3 credentials in Warehouse

2024-05-22 Thread Butao Zhang (Jira)
Butao Zhang created HIVE-28272:
--

 Summary: Support setting per-session S3 credentials in Warehouse
 Key: HIVE-28272
 URL: https://issues.apache.org/jira/browse/HIVE-28272
 Project: Hive
  Issue Type: Improvement
Reporter: Butao Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-25351) stddev(), stddev_pop() with CBO enable returning null

2024-05-22 Thread Dayakar M (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848524#comment-17848524
 ] 

Dayakar M commented on HIVE-25351:
--

[~yangjiandan] currently I am not working on this issue, if you have a solution 
ready then you can take it over and fix it. Thanks.

> stddev(), stddev_pop() with CBO enable returning null
> -
>
> Key: HIVE-25351
> URL: https://issues.apache.org/jira/browse/HIVE-25351
> Project: Hive
>  Issue Type: Bug
>Reporter: Ashish Sharma
>Assignee: Dayakar M
>Priority: Blocker
>  Labels: pull-request-available
>
> *script used to repro*
> create table cbo_test (key string, v1 double, v2 decimal(30,2), v3 
> decimal(30,2));
> insert into cbo_test values ("00140006375905", 10230.72, 
> 10230.72, 10230.69), ("00140006375905", 10230.72, 10230.72, 
> 10230.69), ("00140006375905", 10230.72, 10230.72, 10230.69), 
> ("00140006375905", 10230.72, 10230.72, 10230.69), 
> ("00140006375905", 10230.72, 10230.72, 10230.69), 
> ("00140006375905", 10230.72, 10230.72, 10230.69);
> select stddev(v1), stddev(v2), stddev(v3) from cbo_test;
> *Enable CBO*
> ++
> |  Explain   |
> ++
> | Plan optimized by CBO. |
> ||
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)|
> ||
> | Stage-0|
> |   Fetch Operator   |
> | limit:-1   |
> | Stage-1|
> |   Reducer 2 vectorized |
> |   File Output Operator [FS_13] |
> | Select Operator [SEL_12] (rows=1 width=24) |
> |   Output:["_col0","_col1","_col2"] |
> |   Group By Operator [GBY_11] (rows=1 width=72) |
> | 
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"],aggregations:["sum(VALUE._col0)","sum(VALUE._col1)","count(VALUE._col2)","sum(VALUE._col3)","sum(VALUE._col4)","count(VALUE._col5)","sum(VALUE._col6)","sum(VALUE._col7)","count(VALUE._col8)"]
>  |
> |   <-Map 1 [CUSTOM_SIMPLE_EDGE] vectorized  |
> | PARTITION_ONLY_SHUFFLE [RS_10] |
> |   Group By Operator [GBY_9] (rows=1 width=72) |
> | 
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"],aggregations:["sum(_col3)","sum(_col0)","count(_col0)","sum(_col5)","sum(_col4)","count(_col1)","sum(_col7)","sum(_col6)","count(_col2)"]
>  |
> | Select Operator [SEL_8] (rows=6 width=232) |
> |   
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"] |
> |   TableScan [TS_0] (rows=6 width=232) |
> | default@cbo_test,cbo_test, ACID 
> table,Tbl:COMPLETE,Col:COMPLETE,Output:["v1","v2","v3"] |
> ||
> ++
> *Query Result* 
> _c0   _c1 _c2
> 0.0   NaN NaN
> *Disable CBO*
> ++
> |  Explain   |
> ++
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)|
> ||
> | Stage-0|
> |   Fetch Operator   |
> | limit:-1   |
> | Stage-1 

[jira] [Resolved] (HIVE-28266) Iceberg: select count(*) from data_files metadata tables gives wrong result

2024-05-22 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko resolved HIVE-28266.
---
Fix Version/s: 4.1.0
   Resolution: Fixed

> Iceberg: select count(*) from data_files metadata tables gives wrong result
> ---
>
> Key: HIVE-28266
> URL: https://issues.apache.org/jira/browse/HIVE-28266
> Project: Hive
>  Issue Type: Bug
>Reporter: Dmitriy Fingerman
>Assignee: Dmitriy Fingerman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> In Hive Iceberg, every table has a corresponding metadata table 
> "*.data_files" that contains info about the files that contain table's data.
> select count(*) from a data_file metadata table returns number of rows in the 
> data table instead of number of data files from the metadata table.
>  
> {code:java}
> CREATE TABLE x (name VARCHAR(50), age TINYINT, num_clicks BIGINT) stored by 
> iceberg stored as orc TBLPROPERTIES 
> ('external.table.purge'='true','format-version'='2');
> insert into x values 
> ('amy', 35, 123412344),
> ('adxfvy', 36, 123412534),
> ('amsdfyy', 37, 123417234),
> ('asafmy', 38, 123412534);
> insert into x values 
> ('amerqwy', 39, 123441234),
> ('amyxzcv', 40, 123341234),
> ('erweramy', 45, 122341234);
> Select * from default.x.data_files;
> – Returns 2 records in the output
> Select count from default.x.data_files;
> – Returns 7 instead of 2
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28266) Iceberg: select count(*) from data_files metadata tables gives wrong result

2024-05-22 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-28266:
--
Affects Version/s: 4.0.0

> Iceberg: select count(*) from data_files metadata tables gives wrong result
> ---
>
> Key: HIVE-28266
> URL: https://issues.apache.org/jira/browse/HIVE-28266
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Dmitriy Fingerman
>Assignee: Dmitriy Fingerman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> In Hive Iceberg, every table has a corresponding metadata table 
> "*.data_files" that contains info about the files that contain table's data.
> select count(*) from a data_file metadata table returns number of rows in the 
> data table instead of number of data files from the metadata table.
>  
> {code:java}
> CREATE TABLE x (name VARCHAR(50), age TINYINT, num_clicks BIGINT) stored by 
> iceberg stored as orc TBLPROPERTIES 
> ('external.table.purge'='true','format-version'='2');
> insert into x values 
> ('amy', 35, 123412344),
> ('adxfvy', 36, 123412534),
> ('amsdfyy', 37, 123417234),
> ('asafmy', 38, 123412534);
> insert into x values 
> ('amerqwy', 39, 123441234),
> ('amyxzcv', 40, 123341234),
> ('erweramy', 45, 122341234);
> Select * from default.x.data_files;
> – Returns 2 records in the output
> Select count from default.x.data_files;
> – Returns 7 instead of 2
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28266) Iceberg: select count(*) from data_files metadata tables gives wrong result

2024-05-22 Thread Denys Kuzmenko (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848501#comment-17848501
 ] 

Denys Kuzmenko commented on HIVE-28266:
---

Merged to master
Thanks [~difin] for the patch and [~zhangbutao] for the review!

> Iceberg: select count(*) from data_files metadata tables gives wrong result
> ---
>
> Key: HIVE-28266
> URL: https://issues.apache.org/jira/browse/HIVE-28266
> Project: Hive
>  Issue Type: Bug
>Reporter: Dmitriy Fingerman
>Assignee: Dmitriy Fingerman
>Priority: Major
>  Labels: pull-request-available
>
> In Hive Iceberg, every table has a corresponding metadata table 
> "*.data_files" that contains info about the files that contain table's data.
> select count(*) from a data_file metadata table returns number of rows in the 
> data table instead of number of data files from the metadata table.
>  
> {code:java}
> CREATE TABLE x (name VARCHAR(50), age TINYINT, num_clicks BIGINT) stored by 
> iceberg stored as orc TBLPROPERTIES 
> ('external.table.purge'='true','format-version'='2');
> insert into x values 
> ('amy', 35, 123412344),
> ('adxfvy', 36, 123412534),
> ('amsdfyy', 37, 123417234),
> ('asafmy', 38, 123412534);
> insert into x values 
> ('amerqwy', 39, 123441234),
> ('amyxzcv', 40, 123341234),
> ('erweramy', 45, 122341234);
> Select * from default.x.data_files;
> – Returns 2 records in the output
> Select count from default.x.data_files;
> – Returns 7 instead of 2
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-25351) stddev(), stddev_pop() with CBO enable returning null

2024-05-22 Thread Jiandan Yang (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848487#comment-17848487
 ] 

Jiandan Yang  commented on HIVE-25351:
--

[~Dayakar] I encountered the same issue in Hive version 3.1.3, and from 
reviewing the code, it appears that the current master branch would have the 
same issue. I have fixed this problem in version 3.1.3. If no one is addressing 
this issue, I am prepared to take it over and resolve it.

> stddev(), stddev_pop() with CBO enable returning null
> -
>
> Key: HIVE-25351
> URL: https://issues.apache.org/jira/browse/HIVE-25351
> Project: Hive
>  Issue Type: Bug
>Reporter: Ashish Sharma
>Assignee: Dayakar M
>Priority: Blocker
>  Labels: pull-request-available
>
> *script used to repro*
> create table cbo_test (key string, v1 double, v2 decimal(30,2), v3 
> decimal(30,2));
> insert into cbo_test values ("00140006375905", 10230.72, 
> 10230.72, 10230.69), ("00140006375905", 10230.72, 10230.72, 
> 10230.69), ("00140006375905", 10230.72, 10230.72, 10230.69), 
> ("00140006375905", 10230.72, 10230.72, 10230.69), 
> ("00140006375905", 10230.72, 10230.72, 10230.69), 
> ("00140006375905", 10230.72, 10230.72, 10230.69);
> select stddev(v1), stddev(v2), stddev(v3) from cbo_test;
> *Enable CBO*
> ++
> |  Explain   |
> ++
> | Plan optimized by CBO. |
> ||
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)|
> ||
> | Stage-0|
> |   Fetch Operator   |
> | limit:-1   |
> | Stage-1|
> |   Reducer 2 vectorized |
> |   File Output Operator [FS_13] |
> | Select Operator [SEL_12] (rows=1 width=24) |
> |   Output:["_col0","_col1","_col2"] |
> |   Group By Operator [GBY_11] (rows=1 width=72) |
> | 
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"],aggregations:["sum(VALUE._col0)","sum(VALUE._col1)","count(VALUE._col2)","sum(VALUE._col3)","sum(VALUE._col4)","count(VALUE._col5)","sum(VALUE._col6)","sum(VALUE._col7)","count(VALUE._col8)"]
>  |
> |   <-Map 1 [CUSTOM_SIMPLE_EDGE] vectorized  |
> | PARTITION_ONLY_SHUFFLE [RS_10] |
> |   Group By Operator [GBY_9] (rows=1 width=72) |
> | 
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"],aggregations:["sum(_col3)","sum(_col0)","count(_col0)","sum(_col5)","sum(_col4)","count(_col1)","sum(_col7)","sum(_col6)","count(_col2)"]
>  |
> | Select Operator [SEL_8] (rows=6 width=232) |
> |   
> Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"] |
> |   TableScan [TS_0] (rows=6 width=232) |
> | default@cbo_test,cbo_test, ACID 
> table,Tbl:COMPLETE,Col:COMPLETE,Output:["v1","v2","v3"] |
> ||
> ++
> *Query Result* 
> _c0   _c1 _c2
> 0.0   NaN NaN
> *Disable CBO*
> ++
> |  Explain   |
> ++
> | Vertex dependency in root stage|
> | Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)|
> ||
> | Stage-0  

[jira] [Commented] (HIVE-28258) Use Iceberg semantics for Merge task

2024-05-22 Thread Sourabh Badhya (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848458#comment-17848458
 ] 

Sourabh Badhya commented on HIVE-28258:
---

[~kkasa] , the following task mainly tries to reuse the existing Iceberg 
readers (IcebergRecordReader) rather than using the file-format readers 
according to the table format. This way we can use the existing code for 
handling different file formats (ORC, Parquet, Avro) within Iceberg and avoid 
writing any custom implementations to handle these file-formats.

Additionally, this will help in handling different schemas that Iceberg 
maintains (the data schema and the delete schema) within Iceberg, and not 
expose it through public APIs.

Custom hacks like changing the file format of the merge task is also removed 
which was done earlier.

The existing tests iceberg_merge_files.q should serve as an example for 
debugging the merge task used for Iceberg.

> Use Iceberg semantics for Merge task
> 
>
> Key: HIVE-28258
> URL: https://issues.apache.org/jira/browse/HIVE-28258
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
>
> Use Iceberg semantics for Merge task, instead of normal ORC or parquet 
> readers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-24207) LimitOperator can leverage ObjectCache to bail out quickly

2024-05-21 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848432#comment-17848432
 ] 

Sungwoo Park commented on HIVE-24207:
-

[~abstractdog] Hi, I have a couple of questions of this optimization.

1. An operator tree can contain multiple LimitOperators in general. It seems 
that this optimization works only if LimitOperator has a single child operator 
which should be either RS or TerminalOperator. In other words, a vertex should 
contain a single LimitOperator at most and it should be the last operator 
before emitting final records. Do you know if this property guaranteed by the 
Hive compiler?

2. This optimization may not work if speculative execution is enabled or 
multiple taskattempts are executed in the same LLAP daemon. Or, does this 
optimization assume no speculative execution?






> LimitOperator can leverage ObjectCache to bail out quickly
> --
>
> Key: HIVE-24207
> URL: https://issues.apache.org/jira/browse/HIVE-24207
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0-alpha-1
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> {noformat}
> select  ss_sold_date_sk from store_sales, date_dim where date_dim.d_year in 
> (1998,1998+1,1998+2) and store_sales.ss_sold_date_sk = date_dim.d_date_sk 
> limit 100;
>  select distinct ss_sold_date_sk from store_sales, date_dim where 
> date_dim.d_year in (1998,1998+1,1998+2) and store_sales.ss_sold_date_sk = 
> date_dim.d_date_sk limit 100;
>  {noformat}
> Queries like the above generate a large number of map tasks. Currently they 
> don't bail out after generating enough amount of data. 
> It would be good to make use of ObjectCache & retain the number of records 
> generated. LimitOperator/VectorLimitOperator can bail out for the later tasks 
> in the operator's init phase itself. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorLimitOperator.java#L57
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/LimitOperator.java#L58



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28269) Please have regular releases of hive and its docker image

2024-05-21 Thread Raviteja Lokineni (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848323#comment-17848323
 ] 

Raviteja Lokineni commented on HIVE-28269:
--

[~ayushtkn] May I ask if there can be a faster release cycle?

I’ll try to link all the security tickets here on this one.

> Please have regular releases of hive and its docker image
> -
>
> Key: HIVE-28269
> URL: https://issues.apache.org/jira/browse/HIVE-28269
> Project: Hive
>  Issue Type: Wish
>Reporter: Raviteja Lokineni
>Priority: Major
>
> Hi, we as a company are users of Hive metastore and use the docker images. 
> The latest docker image 4.0.0 has a lot of vulnerabilities. I see most of 
> them are patched in the mainline code but a release has not been made 
> available.
> Can we/I help in anyway to have regular releases at the very least for the 
> security patches? if not us then this is request to the hive maintainers to 
> have regular releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28269) Please have regular releases of hive and its docker image

2024-05-21 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-28269:
--
Priority: Major  (was: Blocker)

> Please have regular releases of hive and its docker image
> -
>
> Key: HIVE-28269
> URL: https://issues.apache.org/jira/browse/HIVE-28269
> Project: Hive
>  Issue Type: Task
>Reporter: Raviteja Lokineni
>Priority: Major
>
> Hi, we as a company are users of Hive metastore and use the docker images. 
> The latest docker image 4.0.0 has a lot of vulnerabilities. I see most of 
> them are patched in the mainline code but a release has not been made 
> available.
> Can we/I help in anyway to have regular releases at the very least for the 
> security patches? if not us then this is request to the hive maintainers to 
> have regular releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28269) Please have regular releases of hive and its docker image

2024-05-21 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-28269:
--
Issue Type: Wish  (was: Task)

> Please have regular releases of hive and its docker image
> -
>
> Key: HIVE-28269
> URL: https://issues.apache.org/jira/browse/HIVE-28269
> Project: Hive
>  Issue Type: Wish
>Reporter: Raviteja Lokineni
>Priority: Major
>
> Hi, we as a company are users of Hive metastore and use the docker images. 
> The latest docker image 4.0.0 has a lot of vulnerabilities. I see most of 
> them are patched in the mainline code but a release has not been made 
> available.
> Can we/I help in anyway to have regular releases at the very least for the 
> security patches? if not us then this is request to the hive maintainers to 
> have regular releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28271) DirectSql fails for AlterPartitions

2024-05-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-28271:
--
Labels: pull-request-available  (was: )

> DirectSql fails for AlterPartitions
> ---
>
> Key: HIVE-28271
> URL: https://issues.apache.org/jira/browse/HIVE-28271
> Project: Hive
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
>
> It fails at three places: (Misses Database Which Uses CLOB & Missing Boolean 
> type conversions Checks
> *First:*
> {noformat}
> 2024-05-21T08:50:16,570  WARN [main] metastore.ObjectStore: Falling back to 
> ORM path due to direct SQL failure (this is not an error): 
> java.lang.ClassCastException: org.apache.derby.impl.jdbc.EmbedClob cannot be 
> cast to java.lang.String at 
> org.apache.hadoop.hive.metastore.ExceptionHandler.newMetaException(ExceptionHandler.java:152)
>  at org.apache.hadoop.hive.metastore.Batchable.runBatched(Batchable.java:92) 
> at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.getParams(DirectSqlUpdatePart.java:748)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateParamTableInBatch(DirectSqlUpdatePart.java:715)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.alterPartitions(DirectSqlUpdatePart.java:636)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.alterPartitions(MetaStoreDirectSql.java:599)
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore$20.getSqlResult(ObjectStore.java:5371);
> {noformat}
> *Second:*
> {noformat}
> 2024-05-21T09:14:36,808  WARN [main] metastore.ObjectStore: Falling back to 
> ORM path due to direct SQL failure (this is not an error): 
> java.lang.ClassCastException: org.apache.derby.impl.jdbc.EmbedClob cannot be 
> cast to java.lang.String at 
> org.apache.hadoop.hive.metastore.ExceptionHandler.newMetaException(ExceptionHandler.java:152)
>  at org.apache.hadoop.hive.metastore.Batchable.runBatched(Batchable.java:92) 
> at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateCDInBatch(DirectSqlUpdatePart.java:1228)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateStorageDescriptorInBatch(DirectSqlUpdatePart.java:888)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.alterPartitions(DirectSqlUpdatePart.java:638)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.alterPartitions(MetaStoreDirectSql.java:599)
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore$20.getSqlResult(ObjectStore.java:5371);{noformat}
> *Third: Missing Boolean check type*
> {noformat}
> 2024-05-21T09:35:44,063  WARN [main] metastore.ObjectStore: Falling back to 
> ORM path due to direct SQL failure (this is not an error): 
> java.sql.BatchUpdateException: A truncation error was encountered trying to 
> shrink CHAR 'false' to length 1. at 
> org.apache.hadoop.hive.metastore.ExceptionHandler.newMetaException(ExceptionHandler.java:152)
>  at org.apache.hadoop.hive.metastore.Batchable.runBatched(Batchable.java:92) 
> at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.lambda$updateSDInBatch$16(DirectSqlUpdatePart.java:926)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateWithStatement(DirectSqlUpdatePart.java:656)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateSDInBatch(DirectSqlUpdatePart.java:926)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateStorageDescriptorInBatch(DirectSqlUpdatePart.java:900)
>  at 
> org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.alterPartitions(DirectSqlUpdatePart.java:638)
>  at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.alterPartitions(MetaStoreDirectSql.java:599)
>  at 
> org.apache.hadoop.hive.metastore.ObjectStore$20.getSqlResult(ObjectStore.java:5371);
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28271) DirectSql fails for AlterPartitions

2024-05-21 Thread Ayush Saxena (Jira)
Ayush Saxena created HIVE-28271:
---

 Summary: DirectSql fails for AlterPartitions
 Key: HIVE-28271
 URL: https://issues.apache.org/jira/browse/HIVE-28271
 Project: Hive
  Issue Type: Bug
Reporter: Ayush Saxena
Assignee: Ayush Saxena


It fails at three places: (Misses Database Which Uses CLOB & Missing Boolean 
type conversions Checks

*First:*
{noformat}
2024-05-21T08:50:16,570  WARN [main] metastore.ObjectStore: Falling back to ORM 
path due to direct SQL failure (this is not an error): 
java.lang.ClassCastException: org.apache.derby.impl.jdbc.EmbedClob cannot be 
cast to java.lang.String at 
org.apache.hadoop.hive.metastore.ExceptionHandler.newMetaException(ExceptionHandler.java:152)
 at org.apache.hadoop.hive.metastore.Batchable.runBatched(Batchable.java:92) at 
org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.getParams(DirectSqlUpdatePart.java:748)
 at 
org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateParamTableInBatch(DirectSqlUpdatePart.java:715)
 at 
org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.alterPartitions(DirectSqlUpdatePart.java:636)
 at 
org.apache.hadoop.hive.metastore.MetaStoreDirectSql.alterPartitions(MetaStoreDirectSql.java:599)
 at 
org.apache.hadoop.hive.metastore.ObjectStore$20.getSqlResult(ObjectStore.java:5371);

{noformat}
*Second:*
{noformat}
2024-05-21T09:14:36,808  WARN [main] metastore.ObjectStore: Falling back to ORM 
path due to direct SQL failure (this is not an error): 
java.lang.ClassCastException: org.apache.derby.impl.jdbc.EmbedClob cannot be 
cast to java.lang.String at 
org.apache.hadoop.hive.metastore.ExceptionHandler.newMetaException(ExceptionHandler.java:152)
 at org.apache.hadoop.hive.metastore.Batchable.runBatched(Batchable.java:92) at 
org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateCDInBatch(DirectSqlUpdatePart.java:1228)
 at 
org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateStorageDescriptorInBatch(DirectSqlUpdatePart.java:888)
 at 
org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.alterPartitions(DirectSqlUpdatePart.java:638)
 at 
org.apache.hadoop.hive.metastore.MetaStoreDirectSql.alterPartitions(MetaStoreDirectSql.java:599)
 at 
org.apache.hadoop.hive.metastore.ObjectStore$20.getSqlResult(ObjectStore.java:5371);{noformat}

*Third: Missing Boolean check type*

{noformat}
2024-05-21T09:35:44,063  WARN [main] metastore.ObjectStore: Falling back to ORM 
path due to direct SQL failure (this is not an error): 
java.sql.BatchUpdateException: A truncation error was encountered trying to 
shrink CHAR 'false' to length 1. at 
org.apache.hadoop.hive.metastore.ExceptionHandler.newMetaException(ExceptionHandler.java:152)
 at org.apache.hadoop.hive.metastore.Batchable.runBatched(Batchable.java:92) at 
org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.lambda$updateSDInBatch$16(DirectSqlUpdatePart.java:926)
 at 
org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateWithStatement(DirectSqlUpdatePart.java:656)
 at 
org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateSDInBatch(DirectSqlUpdatePart.java:926)
 at 
org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.updateStorageDescriptorInBatch(DirectSqlUpdatePart.java:900)
 at 
org.apache.hadoop.hive.metastore.DirectSqlUpdatePart.alterPartitions(DirectSqlUpdatePart.java:638)
 at 
org.apache.hadoop.hive.metastore.MetaStoreDirectSql.alterPartitions(MetaStoreDirectSql.java:599)
 at 
org.apache.hadoop.hive.metastore.ObjectStore$20.getSqlResult(ObjectStore.java:5371);
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28270) Fix missing partition paths bug on drop_database

2024-05-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-28270:
--
Labels: pull-request-available  (was: )

> Fix missing partition paths  bug on drop_database
> -
>
> Key: HIVE-28270
> URL: https://issues.apache.org/jira/browse/HIVE-28270
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Wechar
>Assignee: Wechar
>Priority: Major
>  Labels: pull-request-available
>
> In {{HMSHandler#drop_database_core}}, it needs to collect all partition paths 
> that were not in the subdirectory of the table path, but now it only fetch 
> the last batch of paths.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28269) Please have regular releases of hive and its docker image

2024-05-21 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848287#comment-17848287
 ] 

Ayush Saxena commented on HIVE-28269:
-

just list the tickets which you want in the next release which fixes the 
security vulnerabilities, we will have them in 4.0.1 release planned next month.

If you intend to fix any which isn't already fixed, create a Jira for that & 
you can raise a PR to fix them & put hive-4.0.1-must label on those

> Please have regular releases of hive and its docker image
> -
>
> Key: HIVE-28269
> URL: https://issues.apache.org/jira/browse/HIVE-28269
> Project: Hive
>  Issue Type: Task
>Reporter: Raviteja Lokineni
>Priority: Blocker
>
> Hi, we as a company are users of Hive metastore and use the docker images. 
> The latest docker image 4.0.0 has a lot of vulnerabilities. I see most of 
> them are patched in the mainline code but a release has not been made 
> available.
> Can we/I help in anyway to have regular releases at the very least for the 
> security patches? if not us then this is request to the hive maintainers to 
> have regular releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28270) Fix missing partition paths bug on drop_database

2024-05-21 Thread Wechar (Jira)
Wechar created HIVE-28270:
-

 Summary: Fix missing partition paths  bug on drop_database
 Key: HIVE-28270
 URL: https://issues.apache.org/jira/browse/HIVE-28270
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: Wechar
Assignee: Wechar


In {{HMSHandler#drop_database_core}}, it needs to collect all partition paths 
that were not in the subdirectory of the table path, but now it only fetch the 
last batch of paths.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28269) Please have regular releases of hive and its docker image

2024-05-21 Thread Raviteja Lokineni (Jira)
Raviteja Lokineni created HIVE-28269:


 Summary: Please have regular releases of hive and its docker image
 Key: HIVE-28269
 URL: https://issues.apache.org/jira/browse/HIVE-28269
 Project: Hive
  Issue Type: Task
Reporter: Raviteja Lokineni


Hi, we as a company are users of Hive metastore and use the docker images. The 
latest docker image 4.0.0 has a lot of vulnerabilities. I see most of them are 
patched in the mainline code but a release has not been made available.

Can we/I help in anyway to have regular releases at the very least for the 
security patches? if not us then this is request to the hive maintainers to 
have regular releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28239) Fix bug on HMSHandler#checkLimitNumberOfPartitions

2024-05-21 Thread Denys Kuzmenko (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848213#comment-17848213
 ] 

Denys Kuzmenko commented on HIVE-28239:
---

Merged to master
thanks for the patch [~wechar]!

> Fix bug on HMSHandler#checkLimitNumberOfPartitions
> --
>
> Key: HIVE-28239
> URL: https://issues.apache.org/jira/browse/HIVE-28239
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Wechar
>Assignee: Wechar
>Priority: Major
>  Labels: pull-request-available
>
> {{HMSHandler#checkLimitNumberOfPartitions}} should not compare request size, 
> which can cause the incorrect limit check.
> Assume that HMS configure {{metastore.limit.partition.request}} as 100, the 
> client calls {{get_partitions_by_filter}} with maxParts as 101, and the 
> matching partition number is 50, in this case the HMS server should not throw 
> MetaException by partition limit check.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28239) Fix bug on HMSHandler#checkLimitNumberOfPartitions

2024-05-21 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-28239:
--
Affects Version/s: 4.0.0

> Fix bug on HMSHandler#checkLimitNumberOfPartitions
> --
>
> Key: HIVE-28239
> URL: https://issues.apache.org/jira/browse/HIVE-28239
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: Wechar
>Assignee: Wechar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> {{HMSHandler#checkLimitNumberOfPartitions}} should not compare request size, 
> which can cause the incorrect limit check.
> Assume that HMS configure {{metastore.limit.partition.request}} as 100, the 
> client calls {{get_partitions_by_filter}} with maxParts as 101, and the 
> matching partition number is 50, in this case the HMS server should not throw 
> MetaException by partition limit check.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-28239) Fix bug on HMSHandler#checkLimitNumberOfPartitions

2024-05-21 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko resolved HIVE-28239.
---
Fix Version/s: 4.1.0
   Resolution: Fixed

> Fix bug on HMSHandler#checkLimitNumberOfPartitions
> --
>
> Key: HIVE-28239
> URL: https://issues.apache.org/jira/browse/HIVE-28239
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Wechar
>Assignee: Wechar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> {{HMSHandler#checkLimitNumberOfPartitions}} should not compare request size, 
> which can cause the incorrect limit check.
> Assume that HMS configure {{metastore.limit.partition.request}} as 100, the 
> client calls {{get_partitions_by_filter}} with maxParts as 101, and the 
> matching partition number is 50, in this case the HMS server should not throw 
> MetaException by partition limit check.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-26838) Adding support for a new event "Reload event" in the HMS

2024-05-21 Thread Manish Maheshwari (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Maheshwari updated HIVE-26838:
-
Summary: Adding support for a new event "Reload event" in the HMS  (was: 
Add a new event to improve cache performance in external systems that 
communicates with HMS.)

> Adding support for a new event "Reload event" in the HMS
> 
>
> Key: HIVE-26838
> URL: https://issues.apache.org/jira/browse/HIVE-26838
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Standalone Metastore
>Reporter: Sai Hemanth Gantasala
>Assignee: Sai Hemanth Gantasala
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0-beta-1
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Adding support for a new event "Reload event" in the HMS (HiveMetaStore). 
> This event can be used by external services that depend on HMS for metadata 
> operations to improve its cache performance. In the distributed environment 
> where there are replicas of an external service (with its own cache in each 
> of these replicas) talking to HMS for metadata operations, the reload event 
> can be used to address the cache performance and ensure consistency among all 
> the replicas for a given table/partition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28268) Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false

2024-05-21 Thread Butao Zhang (Jira)
Butao Zhang created HIVE-28268:
--

 Summary: Iceberg: Retrieve row count from iceberg SnapshotSummary 
in case of iceberg.hive.keep.stats=false
 Key: HIVE-28268
 URL: https://issues.apache.org/jira/browse/HIVE-28268
 Project: Hive
  Issue Type: Task
  Components: Iceberg integration
Reporter: Butao Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28268) Iceberg: Retrieve row count from iceberg SnapshotSummary in case of iceberg.hive.keep.stats=false

2024-05-21 Thread Butao Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848151#comment-17848151
 ] 

Butao Zhang commented on HIVE-28268:


PR https://github.com/apache/hive/pull/5215

> Iceberg: Retrieve row count from iceberg SnapshotSummary in case of 
> iceberg.hive.keep.stats=false
> -
>
> Key: HIVE-28268
> URL: https://issues.apache.org/jira/browse/HIVE-28268
> Project: Hive
>  Issue Type: Task
>  Components: Iceberg integration
>Reporter: Butao Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-23993) Handle irrecoverable errors

2024-05-21 Thread Smruti Biswal (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Smruti Biswal updated HIVE-23993:
-
Labels: pull-request-available  (was: pull pull-request-available)

> Handle irrecoverable errors
> ---
>
> Key: HIVE-23993
> URL: https://issues.apache.org/jira/browse/HIVE-23993
> Project: Hive
>  Issue Type: Task
>Reporter: Aasha Medhi
>Assignee: Aasha Medhi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23993.01.patch, HIVE-23993.02.patch, 
> HIVE-23993.03.patch, HIVE-23993.04.patch, HIVE-23993.05.patch, 
> HIVE-23993.06.patch, HIVE-23993.07.patch, HIVE-23993.08.patch, Retry Logic 
> for Replication.pdf
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-23993) Handle irrecoverable errors

2024-05-21 Thread Smruti Biswal (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Smruti Biswal updated HIVE-23993:
-
Labels: pull pull-request-available  (was: pull-request-available)

> Handle irrecoverable errors
> ---
>
> Key: HIVE-23993
> URL: https://issues.apache.org/jira/browse/HIVE-23993
> Project: Hive
>  Issue Type: Task
>Reporter: Aasha Medhi
>Assignee: Aasha Medhi
>Priority: Major
>  Labels: pull, pull-request-available
> Attachments: HIVE-23993.01.patch, HIVE-23993.02.patch, 
> HIVE-23993.03.patch, HIVE-23993.04.patch, HIVE-23993.05.patch, 
> HIVE-23993.06.patch, HIVE-23993.07.patch, HIVE-23993.08.patch, Retry Logic 
> for Replication.pdf
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-25189) Cache the validWriteIdList in query cache before fetching tables from HMS

2024-05-20 Thread Denys Kuzmenko (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847952#comment-17847952
 ] 

Denys Kuzmenko commented on HIVE-25189:
---

[~scarlin], [~kkasa] any ideas if that could be leveraged with HIVE-28238 
(cache later when we have types) or could be reverted?

> Cache the validWriteIdList in query cache before fetching tables from HMS
> -
>
> Key: HIVE-25189
> URL: https://issues.apache.org/jira/browse/HIVE-25189
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Steve Carlin
>Assignee: Steve Carlin
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> For a small performance boost at compile time, we should fetch the 
> validWriteIdList before fetching the tables.  HMS allows these to be batched 
> together in one call.  This will avoid the getTable API from being called 
> twice, because the first time we call it, we pass in a null for 
> validWriteIdList.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-26220) Shade & relocate dependencies in hive-exec to avoid conflicting with downstream projects

2024-05-19 Thread Zhihua Deng (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihua Deng updated HIVE-26220:
---
Labels: pull-request-available  (was: hive-4.0.1-must 
pull-request-available)

> Shade & relocate dependencies in hive-exec to avoid conflicting with 
> downstream projects
> 
>
> Key: HIVE-26220
> URL: https://issues.apache.org/jira/browse/HIVE-26220
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0, 4.0.0-alpha-1
>Reporter: Chao Sun
>Priority: Blocker
>  Labels: pull-request-available
>
> Currently projects like Spark, Trino/Presto, Iceberg, etc, are depending on 
> {{hive-exec:core}} which was removed in HIVE-25531. The reason these projects 
> use {{hive-exec:core}} is because they have the flexibility to exclude, shade 
> & relocate dependencies in {{hive-exec}} that conflict with the ones they 
> brought in by themselves. However, with {{hive-exec}} this is no longer 
> possible, since it is a fat jar that shade those dependencies but do not 
> relocate many of them.
> In order for the downstream projects to consume {{hive-exec}}, we will need 
> to make sure all the dependencies in {{hive-exec}} are properly shaded and 
> relocated, so they won't cause conflicts with those from the downstream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28267) Support merge task functionality for Iceberg delete files

2024-05-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-28267:
--
Labels: pull-request-available  (was: )

> Support merge task functionality for Iceberg delete files
> -
>
> Key: HIVE-28267
> URL: https://issues.apache.org/jira/browse/HIVE-28267
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
>
> Support merge task functionality for Iceberg delete files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28267) Support merge task functionality for Iceberg delete files

2024-05-17 Thread Sourabh Badhya (Jira)
Sourabh Badhya created HIVE-28267:
-

 Summary: Support merge task functionality for Iceberg delete files
 Key: HIVE-28267
 URL: https://issues.apache.org/jira/browse/HIVE-28267
 Project: Hive
  Issue Type: Improvement
Reporter: Sourabh Badhya
Assignee: Sourabh Badhya


Support merge task functionality for Iceberg delete files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28266) Iceberg: select count(*) from data_files metadata tables gives wrong result

2024-05-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-28266:
--
Labels: pull-request-available  (was: )

> Iceberg: select count(*) from data_files metadata tables gives wrong result
> ---
>
> Key: HIVE-28266
> URL: https://issues.apache.org/jira/browse/HIVE-28266
> Project: Hive
>  Issue Type: Bug
>Reporter: Dmitriy Fingerman
>Assignee: Dmitriy Fingerman
>Priority: Major
>  Labels: pull-request-available
>
> In Hive Iceberg, every table has a corresponding metadata table 
> "*.data_files" that contains info about the files that contain table's data.
> select count(*) from a data_file metadata table returns number of rows in the 
> data table instead of number of data files from the metadata table.
>  
> {code:java}
> CREATE TABLE x (name VARCHAR(50), age TINYINT, num_clicks BIGINT) stored by 
> iceberg stored as orc TBLPROPERTIES 
> ('external.table.purge'='true','format-version'='2');
> insert into x values 
> ('amy', 35, 123412344),
> ('adxfvy', 36, 123412534),
> ('amsdfyy', 37, 123417234),
> ('asafmy', 38, 123412534);
> insert into x values 
> ('amerqwy', 39, 123441234),
> ('amyxzcv', 40, 123341234),
> ('erweramy', 45, 122341234);
> Select * from default.x.data_files;
> – Returns 2 records in the output
> Select count from default.x.data_files;
> – Returns 7 instead of 2
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28266) Iceberg: select count(*) from data_files metadata tables gives wrong result

2024-05-16 Thread Dmitriy Fingerman (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Fingerman updated HIVE-28266:
-
Summary: Iceberg: select count(*) from data_files metadata tables gives 
wrong result  (was: Iceberg: select count(*) from *.data_files metadata tables 
gives wrong result)

> Iceberg: select count(*) from data_files metadata tables gives wrong result
> ---
>
> Key: HIVE-28266
> URL: https://issues.apache.org/jira/browse/HIVE-28266
> Project: Hive
>  Issue Type: Bug
>Reporter: Dmitriy Fingerman
>Assignee: Dmitriy Fingerman
>Priority: Major
>
> In Hive Iceberg, every table has a corresponding metadata table 
> "*.data_files" that contains info about the files that contain table's data.
> select count(*) from a data_file metadata table returns number of rows in the 
> data table instead of number of data files from the metadata table.
>  
> {code:java}
> CREATE TABLE x (name VARCHAR(50), age TINYINT, num_clicks BIGINT) stored by 
> iceberg stored as orc TBLPROPERTIES 
> ('external.table.purge'='true','format-version'='2');
> insert into x values 
> ('amy', 35, 123412344),
> ('adxfvy', 36, 123412534),
> ('amsdfyy', 37, 123417234),
> ('asafmy', 38, 123412534);
> insert into x values 
> ('amerqwy', 39, 123441234),
> ('amyxzcv', 40, 123341234),
> ('erweramy', 45, 122341234);
> Select * from default.x.data_files;
> – Returns 2 records in the output
> Select count from default.x.data_files;
> – Returns 7 instead of 2
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28266) Iceberg: select count(*) from *.data_files metadata tables gives wrong result

2024-05-16 Thread Dmitriy Fingerman (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Fingerman updated HIVE-28266:
-
Description: 
In Hive Iceberg, every table has a corresponding metadata table "*.data_files" 
that contains info about the files that contain table's data.

select count(*) from a data_file metadata table returns number of rows in the 
data table instead of number of data files from the metadata table.

 
{code:java}
CREATE TABLE x (name VARCHAR(50), age TINYINT, num_clicks BIGINT) stored by 
iceberg stored as orc TBLPROPERTIES 
('external.table.purge'='true','format-version'='2');
insert into x values 
('amy', 35, 123412344),
('adxfvy', 36, 123412534),
('amsdfyy', 37, 123417234),
('asafmy', 38, 123412534);
insert into x values 
('amerqwy', 39, 123441234),
('amyxzcv', 40, 123341234),
('erweramy', 45, 122341234);
Select * from default.x.data_files;
– Returns 2 records in the output
Select count from default.x.data_files;
– Returns 7 instead of 2
{code}
 

  was:
In Hive Iceberg, every table has a corresponding metadata table "*.data_files" 
that contains info about the files that contain table's data.

select count(*) from a data_file metadata table returns number of rows in the 
data table instead of number of data files from the metadata table.


CREATE TABLE x (name VARCHAR(50), age TINYINT, num_clicks BIGINT) stored by 
iceberg stored as orc TBLPROPERTIES 
('external.table.purge'='true','format-version'='2');

insert into x values 
('amy', 35, 123412344),
('adxfvy', 36, 123412534),
('amsdfyy', 37, 123417234),
('asafmy', 38, 123412534);

insert into x values 
('amerqwy', 39, 123441234),
('amyxzcv', 40, 123341234),
('erweramy', 45, 122341234);

Select * from default.x.data_files;
-- Returns 2 records in the output

Select count(*) from default.x.data_files;
-- Returns 7 instead of 2


> Iceberg: select count(*) from *.data_files metadata tables gives wrong result
> -
>
> Key: HIVE-28266
> URL: https://issues.apache.org/jira/browse/HIVE-28266
> Project: Hive
>  Issue Type: Bug
>Reporter: Dmitriy Fingerman
>Assignee: Dmitriy Fingerman
>Priority: Major
>
> In Hive Iceberg, every table has a corresponding metadata table 
> "*.data_files" that contains info about the files that contain table's data.
> select count(*) from a data_file metadata table returns number of rows in the 
> data table instead of number of data files from the metadata table.
>  
> {code:java}
> CREATE TABLE x (name VARCHAR(50), age TINYINT, num_clicks BIGINT) stored by 
> iceberg stored as orc TBLPROPERTIES 
> ('external.table.purge'='true','format-version'='2');
> insert into x values 
> ('amy', 35, 123412344),
> ('adxfvy', 36, 123412534),
> ('amsdfyy', 37, 123417234),
> ('asafmy', 38, 123412534);
> insert into x values 
> ('amerqwy', 39, 123441234),
> ('amyxzcv', 40, 123341234),
> ('erweramy', 45, 122341234);
> Select * from default.x.data_files;
> – Returns 2 records in the output
> Select count from default.x.data_files;
> – Returns 7 instead of 2
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28266) Iceberg: select count(*) from *.data_files metadata tables gives wrong result

2024-05-16 Thread Dmitriy Fingerman (Jira)
Dmitriy Fingerman created HIVE-28266:


 Summary: Iceberg: select count(*) from *.data_files metadata 
tables gives wrong result
 Key: HIVE-28266
 URL: https://issues.apache.org/jira/browse/HIVE-28266
 Project: Hive
  Issue Type: Bug
Reporter: Dmitriy Fingerman
Assignee: Dmitriy Fingerman


In Hive Iceberg, every table has a corresponding metadata table "*.data_files" 
that contains info about the files that contain table's data.

select count(*) from a data_file metadata table returns number of rows in the 
data table instead of number of data files from the metadata table.


CREATE TABLE x (name VARCHAR(50), age TINYINT, num_clicks BIGINT) stored by 
iceberg stored as orc TBLPROPERTIES 
('external.table.purge'='true','format-version'='2');

insert into x values 
('amy', 35, 123412344),
('adxfvy', 36, 123412534),
('amsdfyy', 37, 123417234),
('asafmy', 38, 123412534);

insert into x values 
('amerqwy', 39, 123441234),
('amyxzcv', 40, 123341234),
('erweramy', 45, 122341234);

Select * from default.x.data_files;
-- Returns 2 records in the output

Select count(*) from default.x.data_files;
-- Returns 7 instead of 2



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28264) OOM/slow compilation when query contains SELECT clauses with nested expressions

2024-05-16 Thread Alessandro Solimando (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847000#comment-17847000
 ] 

Alessandro Solimando commented on HIVE-28264:
-

I guess the problem applies to the respective Calcite rules from which the Hive 
ones were derived, do you know if that has been addressed there?

> OOM/slow compilation when query contains SELECT clauses with nested 
> expressions
> ---
>
> Key: HIVE-28264
> URL: https://issues.apache.org/jira/browse/HIVE-28264
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, HiveServer2
>Affects Versions: 4.0.0
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>
> {code:sql}
> CREATE TABLE t0 (`title` string);
> SELECT x10 from
> (SELECT concat_ws('L10',x9, x9, x9, x9) as x10 from
> (SELECT concat_ws('L9',x8, x8, x8, x8) as x9 from
> (SELECT concat_ws('L8',x7, x7, x7, x7) as x8 from
> (SELECT concat_ws('L7',x6, x6, x6, x6) as x7 from
> (SELECT concat_ws('L6',x5, x5, x5, x5) as x6 from
> (SELECT concat_ws('L5',x4, x4, x4, x4) as x5 from
> (SELECT concat_ws('L4',x3, x3, x3, x3) as x4 from
> (SELECT concat_ws('L3',x2, x2, x2, x2) as x3 
> from
> (SELECT concat_ws('L2',x1, x1, x1, x1) as 
> x2 from
> (SELECT concat_ws('L1',x0, x0, x0, 
> x0) as x1 from
> (SELECT concat_ws('L0',title, 
> title, title, title) as x0 from t0) t1) t2) t3) t4) t5) t6) t7) t8) t9) t10) t
> WHERE x10 = 'Something';
> {code}
> The query above fails with OOM when run with the TestMiniLlapLocalCliDriver 
> and the default max heap size configuration effective for tests (-Xmx2048m).
> {noformat}
> java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
>   at java.lang.StringBuilder.append(StringBuilder.java:136)
>   at org.apache.calcite.rex.RexCall.computeDigest(RexCall.java:152)
>   at org.apache.calcite.rex.RexCall.toString(RexCall.java:165)
>   at org.apache.calcite.rex.RexCall.appendOperands(RexCall.java:105)
>   at org.apache.calcite.rex.RexCall.computeDigest(RexCall.java:151)
>   at org.apache.calcite.rex.RexCall.toString(RexCall.java:165)
>   at java.lang.String.valueOf(String.java:2994)
>   at java.lang.StringBuilder.append(StringBuilder.java:131)
>   at 
> org.apache.calcite.rel.externalize.RelWriterImpl.explain_(RelWriterImpl.java:90)
>   at 
> org.apache.calcite.rel.externalize.RelWriterImpl.done(RelWriterImpl.java:144)
>   at 
> org.apache.calcite.rel.AbstractRelNode.explain(AbstractRelNode.java:246)
>   at 
> org.apache.calcite.rel.externalize.RelWriterImpl.explainInputs(RelWriterImpl.java:122)
>   at 
> org.apache.calcite.rel.externalize.RelWriterImpl.explain_(RelWriterImpl.java:116)
>   at 
> org.apache.calcite.rel.externalize.RelWriterImpl.done(RelWriterImpl.java:144)
>   at 
> org.apache.calcite.rel.AbstractRelNode.explain(AbstractRelNode.java:246)
>   at org.apache.calcite.plan.RelOptUtil.toString(RelOptUtil.java:2308)
>   at org.apache.calcite.plan.RelOptUtil.toString(RelOptUtil.java:2292)
>   at 
> org.apache.hadoop.hive.ql.optimizer.calcite.RuleEventLogger.ruleProductionSucceeded(RuleEventLogger.java:73)
>   at 
> org.apache.calcite.plan.MulticastRelOptListener.ruleProductionSucceeded(MulticastRelOptListener.java:68)
>   at 
> org.apache.calcite.plan.AbstractRelOptPlanner.notifyTransformation(AbstractRelOptPlanner.java:370)
>   at 
> org.apache.calcite.plan.hep.HepPlanner.applyTransformationResults(HepPlanner.java:702)
>   at org.apache.calcite.plan.hep.HepPlanner.applyRule(HepPlanner.java:545)
>   at 
> org.apache.calcite.plan.hep.HepPlanner.applyRules(HepPlanner.java:407)
>   at 
> org.apache.calcite.plan.hep.HepPlanner.executeInstruction(HepPlanner.java:271)
>   at 
> org.apache.calcite.plan.hep.HepInstruction$RuleCollection.execute(HepInstruction.java:74)
>   at 
> org.apache.calcite.plan.hep.HepPlanner.executeProgram(HepPlanner.java:202)
>   at 
> org.apache.calcite.plan.hep.HepPlanner.findBestExp(HepPlanner.java:189)

[jira] [Commented] (HIVE-28264) OOM/slow compilation when query contains SELECT clauses with nested expressions

2024-05-16 Thread Stamatis Zampetakis (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846978#comment-17846978
 ] 

Stamatis Zampetakis commented on HIVE-28264:


To understand the problem let's consider a much simpler variation of the query 
in the description.

{code:sql}
SELECT x1 from
(SELECT concat_ws('L1',x0, x0) as x1 from
(SELECT concat_ws('L0',title, title) as x0 from t0) t1) t2;
{code}

It is easy to see that the SELECT clauses can be merged together leading to the 
following query.
{code:sql}
SELECT concat_ws('L1',concat_ws('L0',title, title), concat_ws('L0',title, 
title)) as x1 from t0;
{code}
The two queries are equivalent, however they don't contain the same number of 
{{concat_ws}} calls. The first contains two calls while the second contains 
three calls and the expression in the SELECT clause is bigger than both of the 
previous expressions.

When the query has nested function calls (CONCAT or anything else) then merging 
those together leads to bigger expressions. In fact the growth rate of the 
expression is exponential to the number of its arguments. 

+Examples:+
When the CONCAT function has two arguments then for each nested level the 
expression grows by a factor of two. The size of the final expression is 
(1-2^L)/(1-2) where L is the levels of nesting.
When the CONCAT functions has four arguments (as the query in the description) 
the for each nested level the expression grows by a factor of four.  The size 
of the final expression is (1-4^L)/(1-4) where L is the levels of nesting.

There are various optimization rules (eg., HiveFieldTrimmerRule, 
HiveProjectMergeRule, etc.) that will try to merge expressions together and 
when this happens in an uncontrolled manner the resulting expression is 
exponentially big, which can further lead to OOM problems, very slow 
compilation,  etc. Clearly it is not always beneficial to merge expressions 
together and the aforementioned rules do have some logic in place to avoid this 
kind of huge expansion. Both rules pass from 
{{RelOptUtil#pushPastProjectUnlessBloat}} so they can be tuned via the bloat 
parameter.

However, there are also other rules that are affected by this exponential 
growth problem , such as {{HiveFilterProjectTransposeRule}}, and currently they 
do not have logic to prevent that.
+Before+
{code:sql}
SELECT x1 from
(SELECT concat_ws('L1',x0, x0) as x1 from
(SELECT concat_ws('L0',title, title) as x0 from t0) t1) t2
WHERE x1 = 'Something';
{code}
+After+
{code:sql}
SELECT x1 from
(SELECT concat_ws('L1',x0, x0) as x1 from
(SELECT concat_ws('L0',title, title) as x0 from t0
 WHERE concat_ws('L1',concat_ws('L0',title, title), 
concat_ws('L0',title, title)) = 'Something') t1) t2;
{code}
In this case the exponential growth happens when trying to push the filter down 
past the projections. A possible solution would be to improve 
HiveFilterProjectTransposeRule and other rules that may be affected to avoid 
creating overly complex expressions using a similar bloat configuration 
parameter.

> OOM/slow compilation when query contains SELECT clauses with nested 
> expressions
> ---
>
> Key: HIVE-28264
> URL: https://issues.apache.org/jira/browse/HIVE-28264
> Project: Hive
>  Issue Type: Bug
>  Components: CBO, HiveServer2
>Affects Versions: 4.0.0
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>
> {code:sql}
> CREATE TABLE t0 (`title` string);
> SELECT x10 from
> (SELECT concat_ws('L10',x9, x9, x9, x9) as x10 from
> (SELECT concat_ws('L9',x8, x8, x8, x8) as x9 from
> (SELECT concat_ws('L8',x7, x7, x7, x7) as x8 from
> (SELECT concat_ws('L7',x6, x6, x6, x6) as x7 from
> (SELECT concat_ws('L6',x5, x5, x5, x5) as x6 from
> (SELECT concat_ws('L5',x4, x4, x4, x4) as x5 from
> (SELECT concat_ws('L4',x3, x3, x3, x3) as x4 from
> (SELECT concat_ws('L3',x2, x2, x2, x2) as x3 
> from
> (SELECT concat_ws('L2',x1, x1, x1, x1) as 
> x2 from
> (SELECT concat_ws('L1',x0, x0, x0, 
> x0) as x1 from
> (SELECT concat_ws('L0',title, 
> title, title, title) as x0 from t0) t1) t2) t3) t4) t5) t6) t7) t8) t9) t10) t
> WHERE x10 = 'Something';
> {code}
> The query above fails with OOM when run with the TestMiniLlapLocalCliDriver 
> and the default max heap size configuration effective for tests (-Xmx2048m).
&g

[jira] [Updated] (HIVE-28254) CBO (Calcite Return Path): Multiple DISTINCT leads to wrong results

2024-05-16 Thread Shohei Okumiya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shohei Okumiya updated HIVE-28254:
--
Status: Patch Available  (was: Open)

> CBO (Calcite Return Path): Multiple DISTINCT leads to wrong results
> ---
>
> Key: HIVE-28254
> URL: https://issues.apache.org/jira/browse/HIVE-28254
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 4.0.0
>Reporter: Shohei Okumiya
>Assignee: Shohei Okumiya
>Priority: Major
>  Labels: hive-4.0.1-must, pull-request-available
>
> CBO return path can build incorrect GroupByOperator when multiple 
> aggregations with DISTINCT are involved.
> This is an example.
> {code:java}
> CREATE TABLE test (col1 INT, col2 INT);
> INSERT INTO test VALUES (1, 100), (2, 200), (2, 200), (3, 300);
> set hive.cbo.returnpath.hiveop=true;
> set hive.map.aggr=false;
> SELECT
>   SUM(DISTINCT col1),
>   COUNT(DISTINCT col1),
>   SUM(DISTINCT col2),
>   SUM(col2)
> FROM test;{code}
> The last column should be 800. But the SUM refers to col1 and the actual 
> result is 8.
> {code:java}
> +--+--+--+--+
> | _c0  | _c1  | _c2  | _c3  |
> +--+--+--+--+
> | 6    | 3    | 600  | 8    |
> +--+--+--+--+ {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28265) Improve the error message for hive.query.timeout.seconds

2024-05-16 Thread Shohei Okumiya (Jira)
Shohei Okumiya created HIVE-28265:
-

 Summary: Improve the error message for hive.query.timeout.seconds
 Key: HIVE-28265
 URL: https://issues.apache.org/jira/browse/HIVE-28265
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 4.0.0
Reporter: Shohei Okumiya
Assignee: Shohei Okumiya


`hive.query.timeout.seconds` seems to be working correctly, but it always says 
it timed out in 0 second.
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> set 
hive.query.timeout.seconds=1s;
No rows affected (0.111 seconds)
0: jdbc:hive2://hive-hiveserver2:1/defaul> select count(*) from test;
...
Error: Query timed out after 0 seconds (state=,code=0){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28264) OOM/slow compilation when query contains SELECT clauses with nested expressions

2024-05-16 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-28264:
--

 Summary: OOM/slow compilation when query contains SELECT clauses 
with nested expressions
 Key: HIVE-28264
 URL: https://issues.apache.org/jira/browse/HIVE-28264
 Project: Hive
  Issue Type: Bug
  Components: CBO, HiveServer2
Affects Versions: 4.0.0
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


{code:sql}
CREATE TABLE t0 (`title` string);
SELECT x10 from
(SELECT concat_ws('L10',x9, x9, x9, x9) as x10 from
(SELECT concat_ws('L9',x8, x8, x8, x8) as x9 from
(SELECT concat_ws('L8',x7, x7, x7, x7) as x8 from
(SELECT concat_ws('L7',x6, x6, x6, x6) as x7 from
(SELECT concat_ws('L6',x5, x5, x5, x5) as x6 from
(SELECT concat_ws('L5',x4, x4, x4, x4) as x5 from
(SELECT concat_ws('L4',x3, x3, x3, x3) as x4 from
(SELECT concat_ws('L3',x2, x2, x2, x2) as x3 
from
(SELECT concat_ws('L2',x1, x1, x1, x1) as 
x2 from
(SELECT concat_ws('L1',x0, x0, x0, x0) 
as x1 from
(SELECT concat_ws('L0',title, 
title, title, title) as x0 from t0) t1) t2) t3) t4) t5) t6) t7) t8) t9) t10) t
WHERE x10 = 'Something';
{code}
The query above fails with OOM when run with the TestMiniLlapLocalCliDriver and 
the default max heap size configuration effective for tests (-Xmx2048m).

{noformat}
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at 
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at org.apache.calcite.rex.RexCall.computeDigest(RexCall.java:152)
at org.apache.calcite.rex.RexCall.toString(RexCall.java:165)
at org.apache.calcite.rex.RexCall.appendOperands(RexCall.java:105)
at org.apache.calcite.rex.RexCall.computeDigest(RexCall.java:151)
at org.apache.calcite.rex.RexCall.toString(RexCall.java:165)
at java.lang.String.valueOf(String.java:2994)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at 
org.apache.calcite.rel.externalize.RelWriterImpl.explain_(RelWriterImpl.java:90)
at 
org.apache.calcite.rel.externalize.RelWriterImpl.done(RelWriterImpl.java:144)
at 
org.apache.calcite.rel.AbstractRelNode.explain(AbstractRelNode.java:246)
at 
org.apache.calcite.rel.externalize.RelWriterImpl.explainInputs(RelWriterImpl.java:122)
at 
org.apache.calcite.rel.externalize.RelWriterImpl.explain_(RelWriterImpl.java:116)
at 
org.apache.calcite.rel.externalize.RelWriterImpl.done(RelWriterImpl.java:144)
at 
org.apache.calcite.rel.AbstractRelNode.explain(AbstractRelNode.java:246)
at org.apache.calcite.plan.RelOptUtil.toString(RelOptUtil.java:2308)
at org.apache.calcite.plan.RelOptUtil.toString(RelOptUtil.java:2292)
at 
org.apache.hadoop.hive.ql.optimizer.calcite.RuleEventLogger.ruleProductionSucceeded(RuleEventLogger.java:73)
at 
org.apache.calcite.plan.MulticastRelOptListener.ruleProductionSucceeded(MulticastRelOptListener.java:68)
at 
org.apache.calcite.plan.AbstractRelOptPlanner.notifyTransformation(AbstractRelOptPlanner.java:370)
at 
org.apache.calcite.plan.hep.HepPlanner.applyTransformationResults(HepPlanner.java:702)
at org.apache.calcite.plan.hep.HepPlanner.applyRule(HepPlanner.java:545)
at 
org.apache.calcite.plan.hep.HepPlanner.applyRules(HepPlanner.java:407)
at 
org.apache.calcite.plan.hep.HepPlanner.executeInstruction(HepPlanner.java:271)
at 
org.apache.calcite.plan.hep.HepInstruction$RuleCollection.execute(HepInstruction.java:74)
at 
org.apache.calcite.plan.hep.HepPlanner.executeProgram(HepPlanner.java:202)
at 
org.apache.calcite.plan.hep.HepPlanner.findBestExp(HepPlanner.java:189)
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.executeProgram(CalcitePlanner.java:2452)
at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.executeProgram(CalcitePlanner.java:2411)

{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28263) Metastore scripts : Update query getting stuck when sub-query of in-clause is returning empty results

2024-05-16 Thread Taraka Rama Rao Lethavadla (Jira)
Taraka Rama Rao Lethavadla created HIVE-28263:
-

 Summary: Metastore scripts : Update query getting stuck when 
sub-query of in-clause is returning empty results
 Key: HIVE-28263
 URL: https://issues.apache.org/jira/browse/HIVE-28263
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: Taraka Rama Rao Lethavadla


As part of fix HIVE-27457

below query is added to 
[upgrade-4.0.0-alpha-2-to-4.0.0-beta-1.mysql.sql|https://github.com/apache/hive/blob/0e84fe2000c026afd0a49f4e7c7dd5f54fe7b1ec/standalone-metastore/metastore-server/src/main/sql/mysql/upgrade-4.0.0-alpha-2-to-4.0.0-beta-1.mysql.sql#L43]
{noformat}
UPDATE SERDES
SET SERDES.SLIB = "org.apache.hadoop.hive.kudu.KuduSerDe"
WHERE SERDE_ID IN (
SELECT SDS.SERDE_ID
FROM TBLS
INNER JOIN SDS ON TBLS.SD_ID = SDS.SD_ID
WHERE TBLS.TBL_ID IN (SELECT TBL_ID FROM TABLE_PARAMS WHERE PARAM_VALUE LIKE 
'%KuduStorageHandler%')
);{noformat}
This query is getting hung when sub-query is returning empty results in MySQL
 

 
{noformat}
MariaDB [test]> SELECT TBL_ID FROM table_params WHERE PARAM_VALUE LIKE 
'%KuduStorageHandler%';
Empty set (0.33 sec)
MariaDB [test]> SELECT sds.SERDE_ID FROM tbls LEFT JOIN sds ON tbls.SD_ID = 
sds.SD_ID WHERE tbls.TBL_ID IN (SELECT TBL_ID FROM table_params WHERE 
PARAM_VALUE LIKE '%KuduStorageHandler%');
Empty set (0.44 sec)
{noformat}
And the query kept on running for more than 20 minutes
{noformat}
MariaDB [test]> UPDATE serdes SET serdes.SLIB = 
"org.apache.hadoop.hive.kudu.KuduSerDe" WHERE SERDE_ID IN ( SELECT sds.SERDE_ID 
FROM tbls LEFT JOIN sds ON tbls.SD_ID = sds.SD_ID WHERE tbls.TBL_ID IN (SELECT 
TBL_ID FROM table_params WHERE PARAM_VALUE LIKE '%KuduStorageHandler%'));
^CCtrl-C -- query killed. Continuing normally.
ERROR 1317 (70100): Query execution was interrupted{noformat}
The explain extended looks like
{noformat}
MariaDB [test]> explain extended UPDATE serdes SET serdes.SLIB = 
"org.apache.hadoop.hive.kudu.KuduSerDe" WHERE SERDE_ID IN ( SELECT sds.SERDE_ID 
FROM tbls LEFT JOIN sds ON tbls.SD_ID = sds.SD_ID WHERE tbls.TBL_ID IN (SELECT 
TBL_ID FROM table_params WHERE PARAM_VALUE LIKE '%KuduStorageHandler%'));
+--++--++---+--+-+-++--+-+
| id   | select_type        | table        | type   | possible_keys             
| key          | key_len | ref             | rows   | filtered | Extra       |
+--++--++---+--+-+-++--+-+
|    1 | PRIMARY            | serdes       | index  | NULL                      
| PRIMARY      | 8       | NULL            | 401267 |   100.00 | Using where |
|    2 | DEPENDENT SUBQUERY | tbls         | index  | PRIMARY,TBLS_N50,TBLS_N49 
| TBLS_N50     | 9       | NULL            |  50921 |   100.00 | Using index |
|    2 | DEPENDENT SUBQUERY |   | eq_ref | distinct_key              
| distinct_key | 8       | func            |      1 |   100.00 |             |
|    2 | DEPENDENT SUBQUERY | sds          | eq_ref | PRIMARY                   
| PRIMARY      | 8       | test.tbls.SD_ID |      1 |   100.00 | Using where |
|    3 | MATERIALIZED       | table_params | ALL    | PRIMARY,TABLE_PARAMS_N49  
| NULL         | NULL    | NULL            | 356593 |   100.00 | Using where |
+--++--++---+--+-+-++--+-+
5 rows in set (0.00 sec){noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28262) Single column use MultiDelimitSerDe parse column error

2024-05-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-28262:
--
Labels: HiveServer2 pull-request-available  (was: HiveServer2)

> Single column use MultiDelimitSerDe parse column error
> --
>
> Key: HIVE-28262
> URL: https://issues.apache.org/jira/browse/HIVE-28262
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 3.1.3, 4.1.0
> Environment: Hive version: 3.1.3
>Reporter: Liu Weizheng
>Assignee: Liu Weizheng
>Priority: Major
>  Labels: HiveServer2, pull-request-available
> Fix For: 4.1.0
>
> Attachments: CleanShot 2024-05-16 at 15.13...@2x.png, CleanShot 
> 2024-05-16 at 15.17...@2x.png
>
>
> ENV:
> Hive: 3.1.3/4.1.0
> HDFS: 3.3.1
> --
> Create a text file for external table load,(e.g:/tmp/data):
>  
> {code:java}
> 1|@|
> 2|@|
> 3|@| {code}
>  
>  
> Create external table:
>  
> {code:java}
> CREATE EXTERNAL TABLE IF NOT EXISTS test_split_tmp(`ID` string) ROW FORMAT 
> SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH 
> SERDEPROPERTIES('field.delim'='|@|') STORED AS textfile location 
> '/tmp/test_split_tmp'; {code}
>  
> put text file to external table path:
>  
> {code:java}
> hdfs dfs -put /tmp/data /tmp/test_split_tmp {code}
>  
>  
> query this table and cast column id to long type:
>  
> {code:java}
> select UDFToLong(`id`) from test_split_tmp; {code}
> *why use UDFToLong function? because  it will get NULL result in this 
> condition,but string type '1' use this function should get  type long 1 
> result.*
> {code:java}
> ++
> | id     |
> ++
> | NULL   |
> | NULL   |
> | NULL   |
> ++ {code}
> Therefore, I speculate that there is an issue with the field splitting in 
> MultiDelimitSerde.
> when I debug this issue, I found some problem below:
>  * org.apache.hadoop.hive.serde2.lazy.LazyStruct#findIndexes
>            *when fields.length=1 can't find the delimit index*
>  
> {code:java}
> private int[] findIndexes(byte[] array, byte[] target) {
>   if (fields.length <= 1) {  // bug
> return new int[0];
>   }
>   ...
>   for (int i = 1; i < indexes.length; i++) {  // bug
> array = Arrays.copyOfRange(array, indexInNewArray + target.length, 
> array.length);
> indexInNewArray = Bytes.indexOf(array, target);
> if (indexInNewArray == -1) {
>   break;
> }
> indexes[i] = indexInNewArray + indexes[i - 1] + target.length;
>   }
>   return indexes;
> }{code}
>  
>  * org.apache.hadoop.hive.serde2.lazy.LazyStruct#parseMultiDelimit
>            *when fields.length=1 can't find the column startPosition*
>  
> {code:java}
> public void parseMultiDelimit(byte[] rawRow, byte[] fieldDelimit) {
>   ...
>   int[] delimitIndexes = findIndexes(rawRow, fieldDelimit);
>   ...
> if (fields.length > 1 && delimitIndexes[i - 1] != -1) { // bug
>   int start = delimitIndexes[i - 1] + fieldDelimit.length;
>   startPosition[i] = start - i * diff;
> } else {
>   startPosition[i] = length + 1;
> }
>   }
>   Arrays.fill(fieldInited, false);
>   parsed = true;
> }{code}
>  
>  
> Multi delimit Process:
> *Actual:*  1|@| -> 1^A  id column start 0 ,next column start 1
> *Expected:*  1|@| -> 1^A  id column start 0 ,next column start 2
>  
> Fix:
>  # fields.length=1 should  find multi delimit index
>  # fields.length=1 should  calculate column start position correct
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-28252) AssertionError when using HiveTableScan with a HepPlanner cluster

2024-05-16 Thread Stamatis Zampetakis (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stamatis Zampetakis resolved HIVE-28252.

Fix Version/s: 4.1.0
   Resolution: Fixed

Fixed in 
https://github.com/apache/hive/commit/0e84fe2000c026afd0a49f4e7c7dd5f54fe7b1ec. 
Thanks for the review [~kkasa]!


> AssertionError when using HiveTableScan with a HepPlanner cluster
> -
>
> Key: HIVE-28252
> URL: https://issues.apache.org/jira/browse/HIVE-28252
> Project: Hive
>  Issue Type: Improvement
>  Components: CBO, Tests
>Affects Versions: 4.0.0
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> The {{HiveTableScan}} operator throws an 
> [AssertionError|https://github.com/apache/hive/blob/7950967eae9640fcc0aa22f4b6c7906b34281eac/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/reloperators/HiveTableScan.java#L153]
>  if the operator does not have the {{HiveRelNode.CONVENTION}} set.
> The {{HepPlanner}} does not use any 
> [RelTraitDef|https://github.com/apache/calcite/blob/f854ef5ee480e0ff893b18d27ec67dc381ee2244/core/src/main/java/org/apache/calcite/plan/AbstractRelOptPlanner.java#L276]
>  so the default [empty traitset for the respective 
> cluster|https://github.com/apache/calcite/blob/f854ef5ee480e0ff893b18d27ec67dc381ee2244/core/src/main/java/org/apache/calcite/plan/RelOptCluster.java#L99]
>  is gonna be always empty.
> In principle we should not be able to use the {{HiveTableScan}} operator with 
> {{HepPlanner}}. However, the optimizer heavily uses the {{HepPlanner}} (in 
> fact more than the {{VolcanoPlanner}} and it is reasonable to wonder how is 
> this possible given that this assertion is in place. The assertion is 
> circumvented by creating a cluster from a 
> [VolcanoPlanner|https://github.com/apache/hive/blob/7950967eae9640fcc0aa22f4b6c7906b34281eac/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L1620]
>  and then using it in the 
> [HepPlanner|https://github.com/apache/hive/blob/7950967eae9640fcc0aa22f4b6c7906b34281eac/ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java#L2422].
>  
> This cluster usage is a bit contrived but does not necessarily need to change 
> at this stage.
> Nevertheless, since the {{HiveTableScan}} operator is suitable to run with 
> the {{HepPlanner}} the assertion can be relaxed (or removed altogether) to 
> better reflect the actual usage of the operator, and allow passing a "true" 
> HepPlanner cluster inside the operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28262) Single column use MultiDelimitSerDe parse column error

2024-05-16 Thread Liu Weizheng (Jira)
Liu Weizheng created HIVE-28262:
---

 Summary: Single column use MultiDelimitSerDe parse column error
 Key: HIVE-28262
 URL: https://issues.apache.org/jira/browse/HIVE-28262
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Affects Versions: 3.1.3, 4.1.0
 Environment: Hive version: 3.1.3
Reporter: Liu Weizheng
Assignee: Liu Weizheng
 Fix For: 4.1.0
 Attachments: CleanShot 2024-05-16 at 15.13...@2x.png, CleanShot 
2024-05-16 at 15.17...@2x.png

ENV:

Hive: 3.1.3/4.1.0

HDFS: 3.3.1

--

Create a text file for external table load,(e.g:/tmp/data):

 
{code:java}
1|@|
2|@|
3|@| {code}
 

 

Create external table:

 
{code:java}
CREATE EXTERNAL TABLE IF NOT EXISTS test_split_tmp(`ID` string) ROW FORMAT 
SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH 
SERDEPROPERTIES('field.delim'='|@|') STORED AS textfile location 
'/tmp/test_split_tmp'; {code}
 

put text file to external table path:

 
{code:java}
hdfs dfs -put /tmp/data /tmp/test_split_tmp {code}
 

 

query this table and cast column id to long type:

 
{code:java}
select UDFToLong(`id`) from test_split_tmp; {code}
*why use UDFToLong function? because  it will get NULL result in this 
condition,but string type '1' use this function should get  type long 1 result.*


{code:java}
++
| id     |
++
| NULL   |
| NULL   |
| NULL   |
++ {code}
Therefore, I speculate that there is an issue with the field splitting in 
MultiDelimitSerde.

when I debug this issue, I found some problem below:
 * org.apache.hadoop.hive.serde2.lazy.LazyStruct#findIndexes

           *when fields.length=1 can't find the delimit index*

 
{code:java}
private int[] findIndexes(byte[] array, byte[] target) {
  if (fields.length <= 1) {  // bug
return new int[0];
  }
  ...
  for (int i = 1; i < indexes.length; i++) {  // bug
array = Arrays.copyOfRange(array, indexInNewArray + target.length, 
array.length);
indexInNewArray = Bytes.indexOf(array, target);
if (indexInNewArray == -1) {
  break;
}
indexes[i] = indexInNewArray + indexes[i - 1] + target.length;
  }
  return indexes;
}{code}
 
 * org.apache.hadoop.hive.serde2.lazy.LazyStruct#parseMultiDelimit

           *when fields.length=1 can't find the column startPosition*

 
{code:java}
public void parseMultiDelimit(byte[] rawRow, byte[] fieldDelimit) {
  ...
  int[] delimitIndexes = findIndexes(rawRow, fieldDelimit);
  ...
if (fields.length > 1 && delimitIndexes[i - 1] != -1) { // bug
  int start = delimitIndexes[i - 1] + fieldDelimit.length;
  startPosition[i] = start - i * diff;
} else {
  startPosition[i] = length + 1;
}
  }
  Arrays.fill(fieldInited, false);
  parsed = true;
}{code}
 

 


Multi delimit Process:

*Actual:*  1|@| -> 1^A  id column start 0 ,next column start 1

*Expected:*  1|@| -> 1^A  id column start 0 ,next column start 2

 

Fix:
 # fields.length=1 should  find multi delimit index
 # fields.length=1 should  calculate column start position correct

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28249) Parquet legacy timezone conversion converts march 1st to 29th feb and fails with not a leap year exception

2024-05-16 Thread Simhadri Govindappa (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846837#comment-17846837
 ] 

Simhadri Govindappa commented on HIVE-28249:


Thanks, [~dkuzmenko] and  [~zabetak]  for the review and all the help :) 
Change is merged to master. 

 

It looks like the jodd authors have acknowledged it as a bug:   
[https://github.com/oblac/jodd-util/issues/21] .

 

 

> Parquet legacy timezone conversion converts march 1st to 29th feb and fails 
> with not a leap year exception
> --
>
> Key: HIVE-28249
> URL: https://issues.apache.org/jira/browse/HIVE-28249
> Project: Hive
>  Issue Type: Task
>Reporter: Simhadri Govindappa
>Assignee: Simhadri Govindappa
>Priority: Major
>  Labels: pull-request-available
>
> When handling legacy time stamp conversions in parquet,'February 29' year 
> '200' is an edge case.
> This is because, according to this: [https://www.lanl.gov/Caesar/node202.html]
> The Julian day for 200 CE/02/29 in the Julian calendar is different from the 
> Julian day in Gregorian Calendar .
> ||Date (BC/AD)||Date (CE)||Julian Day||Julian Day||
> |-|  -|(Julian Calendar)|(Gregorian Calendar)|
> |200 AD/02/28|200 CE/02/28|1794166|1794167|
> |200 AD/02/29|200 CE/02/29|1794167|1794168|
> |200 AD/03/01|200 CE/03/01|1794168|1794168|
> |300 AD/02/28|300 CE/02/28|1830691|1830691|
> |300 AD/02/29|300 CE/02/29|1830692|1830692|
> |300 AD/03/01|300 CE/03/01|1830693|1830692|
>  
>  * Because of this:
> {noformat}
> int julianDay = nt.getJulianDay(); {noformat}
> returns julian day 1794167 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java#L92]
>  * Later :
> {noformat}
> Timestamp result = Timestamp.valueOf(formatter.format(date)); {noformat}
> _{{{}formatter.format(date{}}})_ returns 29-02-200 as it seems to be using 
> julian calendar
> but _{{Timestamp.valueOf(29-02-200)}}_ seems to be using gregorian calendar 
> and fails with "not a leap year exception" for 29th Feb 200"
> [https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java#L196]
> Since hive stores timestamp in UTC, when converting 200 CE/03/01 between 
> timezones, hive runs into an exception and fails with "not a leap year 
> exception" for 29th Feb 200 even if the actual record inserted was 200 
> CE/03/01 in Asia/Singapore timezone.
>  
> Fullstack trace:
> {noformat}
> java.lang.RuntimeException: java.io.IOException: 
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file 
> file:/Users/simhadri.govindappa/Documents/apache/hive/itests/qtest/target/localfs/warehouse/test_sgt/sgt000
>     at 
> org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:210)
>     at org.apache.hadoop.hive.ql.exec.FetchTask.execute(FetchTask.java:95)
>     at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:212)
>     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:154)
>     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:149)
>     at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:185)
>     at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:230)
>     at 
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:257)
>     at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:201)
>     at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:127)
>     at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:425)
>     at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:356)
>     at 
> org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:732)
>     at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:702)
>     at 
> org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:116)
>     at 
> org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:157)
>     at 
> org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver(TestMiniLlapLocalCliDriver.java:62)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
&g

[jira] [Resolved] (HIVE-28249) Parquet legacy timezone conversion converts march 1st to 29th feb and fails with not a leap year exception

2024-05-16 Thread Simhadri Govindappa (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simhadri Govindappa resolved HIVE-28249.

Fix Version/s: 4.1.0
   Resolution: Fixed

> Parquet legacy timezone conversion converts march 1st to 29th feb and fails 
> with not a leap year exception
> --
>
> Key: HIVE-28249
> URL: https://issues.apache.org/jira/browse/HIVE-28249
> Project: Hive
>  Issue Type: Task
>Reporter: Simhadri Govindappa
>Assignee: Simhadri Govindappa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> When handling legacy time stamp conversions in parquet,'February 29' year 
> '200' is an edge case.
> This is because, according to this: [https://www.lanl.gov/Caesar/node202.html]
> The Julian day for 200 CE/02/29 in the Julian calendar is different from the 
> Julian day in Gregorian Calendar .
> ||Date (BC/AD)||Date (CE)||Julian Day||Julian Day||
> |-|  -|(Julian Calendar)|(Gregorian Calendar)|
> |200 AD/02/28|200 CE/02/28|1794166|1794167|
> |200 AD/02/29|200 CE/02/29|1794167|1794168|
> |200 AD/03/01|200 CE/03/01|1794168|1794168|
> |300 AD/02/28|300 CE/02/28|1830691|1830691|
> |300 AD/02/29|300 CE/02/29|1830692|1830692|
> |300 AD/03/01|300 CE/03/01|1830693|1830692|
>  
>  * Because of this:
> {noformat}
> int julianDay = nt.getJulianDay(); {noformat}
> returns julian day 1794167 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java#L92]
>  * Later :
> {noformat}
> Timestamp result = Timestamp.valueOf(formatter.format(date)); {noformat}
> _{{{}formatter.format(date{}}})_ returns 29-02-200 as it seems to be using 
> julian calendar
> but _{{Timestamp.valueOf(29-02-200)}}_ seems to be using gregorian calendar 
> and fails with "not a leap year exception" for 29th Feb 200"
> [https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java#L196]
> Since hive stores timestamp in UTC, when converting 200 CE/03/01 between 
> timezones, hive runs into an exception and fails with "not a leap year 
> exception" for 29th Feb 200 even if the actual record inserted was 200 
> CE/03/01 in Asia/Singapore timezone.
>  
> Fullstack trace:
> {noformat}
> java.lang.RuntimeException: java.io.IOException: 
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file 
> file:/Users/simhadri.govindappa/Documents/apache/hive/itests/qtest/target/localfs/warehouse/test_sgt/sgt000
>     at 
> org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:210)
>     at org.apache.hadoop.hive.ql.exec.FetchTask.execute(FetchTask.java:95)
>     at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:212)
>     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:154)
>     at org.apache.hadoop.hive.ql.Driver.run(Driver.java:149)
>     at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:185)
>     at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:230)
>     at 
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:257)
>     at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:201)
>     at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:127)
>     at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:425)
>     at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:356)
>     at 
> org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:732)
>     at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:702)
>     at 
> org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:116)
>     at 
> org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:157)
>     at 
> org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver(TestMiniLlapLocalCliDriver.java:62)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>     at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable

[jira] [Resolved] (HIVE-28251) HiveSessionImpl init ReaderStream should set Charset with UTF-8

2024-05-15 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena resolved HIVE-28251.
-
Resolution: Fixed

> HiveSessionImpl init ReaderStream should set Charset with UTF-8
> ---
>
> Key: HIVE-28251
> URL: https://issues.apache.org/jira/browse/HIVE-28251
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 3.1.3
>Reporter: xy
>Assignee: xy
>Priority: Major
> Fix For: 4.1.0
>
>
> Fix some StreamReader not set with UTF8,if we actually default charset not 
> support Chinese chars such as latin and conf contain Chinese chars,it would 
> not resolve success,so we need set it as utf8 in StreamReader,we can find all 
> StreamReader with utf8 charset in other compute framework,such as 
> Calcite、Hudi and so on.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28251) HiveSessionImpl init ReaderStream should set Charset with UTF-8

2024-05-15 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846748#comment-17846748
 ] 

Ayush Saxena commented on HIVE-28251:
-

Committed to master.
Thanx [~xuzifu] for the contribution!!!

> HiveSessionImpl init ReaderStream should set Charset with UTF-8
> ---
>
> Key: HIVE-28251
> URL: https://issues.apache.org/jira/browse/HIVE-28251
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 3.1.3
>Reporter: xy
>Assignee: xy
>Priority: Major
> Fix For: 4.1.0
>
>
> Fix some StreamReader not set with UTF8,if we actually default charset not 
> support Chinese chars such as latin and conf contain Chinese chars,it would 
> not resolve success,so we need set it as utf8 in StreamReader,we can find all 
> StreamReader with utf8 charset in other compute framework,such as 
> Calcite、Hudi and so on.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   6   7   8   9   10   >