[jira] [Updated] (HIVE-27956) Query based compactor implementation separation

2023-12-14 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-27956:
-
Component/s: Hive

> Query based compactor implementation separation
> ---
>
> Key: HIVE-27956
> URL: https://issues.apache.org/jira/browse/HIVE-27956
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Minor
>
> Currently all Query based compactors are based on the CompactionQueryBuilder 
> class, where the query generation for each implementation is mixed 
> altogether. This can lead to issues when changing the query as the changes 
> may affect multiple compactors. Query generation should moved inside the 
> query compactors, and this class should be a Utility/Helper class to provide 
> common features.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27956) Query based compactor implementation separation

2023-12-14 Thread Marta Kuczora (Jira)
Marta Kuczora created HIVE-27956:


 Summary: Query based compactor implementation separation
 Key: HIVE-27956
 URL: https://issues.apache.org/jira/browse/HIVE-27956
 Project: Hive
  Issue Type: Improvement
Reporter: Marta Kuczora
Assignee: Marta Kuczora


Currently all Query based compactors are based on the CompactionQueryBuilder 
class, where the query generation for each implementation is mixed altogether. 
This can lead to issues when changing the query as the changes may affect 
multiple compactors. Query generation should moved inside the query compactors, 
and this class should be a Utility/Helper class to provide common features.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HIVE-22969) Union remove optimisation results incorrect data when inserting to ACID table

2021-08-23 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-22969:


Assignee: (was: Marta Kuczora)

> Union remove optimisation results incorrect data when inserting to ACID table
> -
>
> Key: HIVE-22969
> URL: https://issues.apache.org/jira/browse/HIVE-22969
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Priority: Major
>
> Steps to reproduce the issue:
> {noformat}
> create table input_text(key string, val string) stored as textfile location 
> '/Users/martakuczora/work/hive/warehouse/external/input_text';
> create table output_acid(key string, val string) stored as orc 
> tblproperties('transactional'='true');
> insert into input_text values ('1','1'), ('2','2'),('3','3');
> {noformat}
> {noformat}
> set hive.mapred.mode=nonstrict;
> set hive.stats.autogather=false;
> set hive.optimize.union.remove=true;
> set hive.auto.convert.join=true;
> set hive.exec.submitviachild=false;
> set hive.exec.submit.local.task.via.child=false;
> SELECT * FROM (
> select key, val from input_text
> union all
> select a.key as key, b.val as val FROM input_text a join input_text b on 
> a.key=b.key) c;
> The result of the select:
> 1 1
> 2 2
> 3 3
> 1 1
> 2 2
> 3 3
> {noformat}
> {noformat}
> insert into table output_acid
> SELECT * FROM (
> select key, val from input_text
> union all
> select a.key as key, b.val as val FROM input_text a join input_text b on 
> a.key=b.key) c;
> select * from output_acid;
> The result:
> 1 1
> 2 2
> 3 3
> {noformat}
> The folder of the output_acid table contained the following delta directories:
> {noformat}
> drwxr-xr-x  6 martakuczora  staff  192 Mar  2 16:29 delta_000_000
> drwxr-xr-x  6 martakuczora  staff  192 Mar  2 16:29 delta_001_001_0001
> {noformat}
> It can be seen that the statement ID from the first directory is missing and 
> when the select statements runs on the table, this directory will be ignored. 
> That's why only half of the data got returned when running the select on the 
> output_acid table.
> If either hive.stats.autogather is set to true or hive.optimize.union.remove 
> is set to false the result of the insert will be correct. In this case there 
> will be only 1 delta directory in the table's folder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-25457) Implement querying Iceberg table metadata

2021-08-23 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403053#comment-17403053
 ] 

Marta Kuczora commented on HIVE-25457:
--

Pushed to master. Thanks a lot [~pvary] for the review!

> Implement querying Iceberg table metadata
> -
>
> Key: HIVE-25457
> URL: https://issues.apache.org/jira/browse/HIVE-25457
> Project: Hive
>  Issue Type: New Feature
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In order to be able to query Iceberg table metadata (snapshots, manifests, 
> files), we should add syntax support and implement the feature as well to 
> return the correct response.
> The Iceberg metadata tables can be addressed like this: 
> ..
> For example: default.iceberg_table.history
> The following metadata tables are available in Iceberg:
>   ENTRIES,
>   FILES,
>   HISTORY,
>   SNAPSHOTS,
>   MANIFESTS,
>   PARTITIONS,
>   ALL_DATA_FILES,
>   ALL_MANIFESTS,
>   ALL_ENTRIES



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25457) Implement querying Iceberg table metadata

2021-08-23 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora resolved HIVE-25457.
--
Resolution: Fixed

> Implement querying Iceberg table metadata
> -
>
> Key: HIVE-25457
> URL: https://issues.apache.org/jira/browse/HIVE-25457
> Project: Hive
>  Issue Type: New Feature
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In order to be able to query Iceberg table metadata (snapshots, manifests, 
> files), we should add syntax support and implement the feature as well to 
> return the correct response.
> The Iceberg metadata tables can be addressed like this: 
> ..
> For example: default.iceberg_table.history
> The following metadata tables are available in Iceberg:
>   ENTRIES,
>   FILES,
>   HISTORY,
>   SNAPSHOTS,
>   MANIFESTS,
>   PARTITIONS,
>   ALL_DATA_FILES,
>   ALL_MANIFESTS,
>   ALL_ENTRIES



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25457) Implement querying Iceberg table metadata

2021-08-17 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25457:
-
Description: 
In order to be able to query Iceberg table metadata (snapshots, manifests, 
files), we should add syntax support and implement the feature as well to 
return the correct response.

The Iceberg metadata tables can be addressed like this: 
..
For example: default.iceberg_table.history

The following metadata tables are available in Iceberg:
  ENTRIES,
  FILES,
  HISTORY,
  SNAPSHOTS,
  MANIFESTS,
  PARTITIONS,
  ALL_DATA_FILES,
  ALL_MANIFESTS,
  ALL_ENTRIES


> Implement querying Iceberg table metadata
> -
>
> Key: HIVE-25457
> URL: https://issues.apache.org/jira/browse/HIVE-25457
> Project: Hive
>  Issue Type: New Feature
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> In order to be able to query Iceberg table metadata (snapshots, manifests, 
> files), we should add syntax support and implement the feature as well to 
> return the correct response.
> The Iceberg metadata tables can be addressed like this: 
> ..
> For example: default.iceberg_table.history
> The following metadata tables are available in Iceberg:
>   ENTRIES,
>   FILES,
>   HISTORY,
>   SNAPSHOTS,
>   MANIFESTS,
>   PARTITIONS,
>   ALL_DATA_FILES,
>   ALL_MANIFESTS,
>   ALL_ENTRIES



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25457) Implement querying Iceberg table metadata

2021-08-17 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-25457:



> Implement querying Iceberg table metadata
> -
>
> Key: HIVE-25457
> URL: https://issues.apache.org/jira/browse/HIVE-25457
> Project: Hive
>  Issue Type: New Feature
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25325) Add TRUNCATE TABLE support for Hive Iceberg tables

2021-07-22 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora resolved HIVE-25325.
--
Resolution: Fixed

> Add TRUNCATE TABLE support for Hive Iceberg tables
> --
>
> Key: HIVE-25325
> URL: https://issues.apache.org/jira/browse/HIVE-25325
> Project: Hive
>  Issue Type: Improvement
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Implement the TRUNCATE operation for Hive Iceberg tables. Since these tables 
> are unpartitioned in Hive, only the truncate unpartitioned table use case has 
> to be supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-25325) Add TRUNCATE TABLE support for Hive Iceberg tables

2021-07-22 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385595#comment-17385595
 ] 

Marta Kuczora commented on HIVE-25325:
--

Pushed to master. Thanks a lot [~pvary] for the review!

> Add TRUNCATE TABLE support for Hive Iceberg tables
> --
>
> Key: HIVE-25325
> URL: https://issues.apache.org/jira/browse/HIVE-25325
> Project: Hive
>  Issue Type: Improvement
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Implement the TRUNCATE operation for Hive Iceberg tables. Since these tables 
> are unpartitioned in Hive, only the truncate unpartitioned table use case has 
> to be supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25357) Fix the checkstyle issue in HiveIcebergMetaHook and the iceberg test issues to unblock the pre-commit tests

2021-07-20 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25357:
-
Summary: Fix the checkstyle issue in HiveIcebergMetaHook and the iceberg 
test issues to unblock the pre-commit tests  (was: Fix the checkstyle issue in 
HiveIcebergMetaHook an ]which breaks the build)

> Fix the checkstyle issue in HiveIcebergMetaHook and the iceberg test issues 
> to unblock the pre-commit tests
> ---
>
> Key: HIVE-25357
> URL: https://issues.apache.org/jira/browse/HIVE-25357
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [ERROR] 
> /home/jenkins/agent/workspace/hive-precommit_master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java:221:3:
>  Cyclomatic Complexity is 13 (max allowed is 12). [CyclomaticComplexity]
> This issue probably came in with 
> [this|https://github.com/apache/hive/commit/76c49b9df957c8c05b81a4016282c03648b728b9]
>  commit 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25357) Fix the checkstyle issue in HiveIcebergMetaHook an ]which breaks the build

2021-07-20 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25357:
-
Summary: Fix the checkstyle issue in HiveIcebergMetaHook an ]which breaks 
the build  (was: Fix the checkstyle issue in HiveIcebergMetaHook which breaks 
the build)

> Fix the checkstyle issue in HiveIcebergMetaHook an ]which breaks the build
> --
>
> Key: HIVE-25357
> URL: https://issues.apache.org/jira/browse/HIVE-25357
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [ERROR] 
> /home/jenkins/agent/workspace/hive-precommit_master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java:221:3:
>  Cyclomatic Complexity is 13 (max allowed is 12). [CyclomaticComplexity]
> This issue probably came in with 
> [this|https://github.com/apache/hive/commit/76c49b9df957c8c05b81a4016282c03648b728b9]
>  commit 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25357) Fix the checkstyle issue in HiveIcebergMetaHook which breaks the build

2021-07-20 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-25357:



> Fix the checkstyle issue in HiveIcebergMetaHook which breaks the build
> --
>
> Key: HIVE-25357
> URL: https://issues.apache.org/jira/browse/HIVE-25357
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> [ERROR] 
> /home/jenkins/agent/workspace/hive-precommit_master/iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java:221:3:
>  Cyclomatic Complexity is 13 (max allowed is 12). [CyclomaticComplexity]
> This issue probably came in with 
> [this|https://github.com/apache/hive/commit/76c49b9df957c8c05b81a4016282c03648b728b9]
>  commit 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25325) Add TRUNCATE TABLE support for Hive Iceberg tables

2021-07-12 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25325:
-
Description: Implement the TRUNCATE operation for Hive Iceberg tables. 
Since these tables are unpartitioned in Hive, only the truncate unpartitioned 
table use case has to be supported.

> Add TRUNCATE TABLE support for Hive Iceberg tables
> --
>
> Key: HIVE-25325
> URL: https://issues.apache.org/jira/browse/HIVE-25325
> Project: Hive
>  Issue Type: Improvement
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> Implement the TRUNCATE operation for Hive Iceberg tables. Since these tables 
> are unpartitioned in Hive, only the truncate unpartitioned table use case has 
> to be supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25325) Add TRUNCATE TABLE support for Hive Iceberg tables

2021-07-12 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-25325:



> Add TRUNCATE TABLE support for Hive Iceberg tables
> --
>
> Key: HIVE-25325
> URL: https://issues.apache.org/jira/browse/HIVE-25325
> Project: Hive
>  Issue Type: Improvement
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25264) Add tests to verify Hive can read/write after schema change on Iceberg table

2021-07-12 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora resolved HIVE-25264.
--
Resolution: Fixed

Pushed to master. Thanks a lot [~szita] and [~mbod] for the review!

> Add tests to verify Hive can read/write after schema change on Iceberg table
> 
>
> Key: HIVE-25264
> URL: https://issues.apache.org/jira/browse/HIVE-25264
> Project: Hive
>  Issue Type: Test
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> We should verify if Hive can properly read/write Iceberg tables after their 
> schema was modified through the Iceberg API (it's like when an other engine, 
> like Spark has done modification on the schema). 
> Unit tests should be added for the following operations offered by the 
> UpdateSchema interface in the Iceberg API:
> - adding new top level column
> - adding new nested column
> - adding required column
> - adding required nested column
> - renaming a column
> - updating a column
> - making a column required
> - delete a column
> - changing the order of the columns in the schema



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25264) Add tests to verify Hive can read/write after schema change on Iceberg table

2021-07-12 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25264:
-
Fix Version/s: 4.0.0

> Add tests to verify Hive can read/write after schema change on Iceberg table
> 
>
> Key: HIVE-25264
> URL: https://issues.apache.org/jira/browse/HIVE-25264
> Project: Hive
>  Issue Type: Test
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> We should verify if Hive can properly read/write Iceberg tables after their 
> schema was modified through the Iceberg API (it's like when an other engine, 
> like Spark has done modification on the schema). 
> Unit tests should be added for the following operations offered by the 
> UpdateSchema interface in the Iceberg API:
> - adding new top level column
> - adding new nested column
> - adding required column
> - adding required nested column
> - renaming a column
> - updating a column
> - making a column required
> - delete a column
> - changing the order of the columns in the schema



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25310) Fix local test run problems with Iceberg tests: Socket closed by peer

2021-07-08 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora resolved HIVE-25310.
--
Resolution: Fixed

Pushed to master. Thanks a lot [~szita] for the review!

> Fix local test run problems with Iceberg tests: Socket closed by peer
> -
>
> Key: HIVE-25310
> URL: https://issues.apache.org/jira/browse/HIVE-25310
> Project: Hive
>  Issue Type: Test
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When running the tests in the iceberg-catalog and iceberg-handler module, 
> locally using mvn, we often get errors like this:
> [ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time elapsed: 5.022 
> s  <<< ERROR![ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time 
> elapsed: 5.022 s  <<< ERROR!org.apache.thrift.transport.TTransportException: 
> Socket is closed by peer. at 
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:181)
>  at org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) at 
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:463) 
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:361) 
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:244)
>  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_database(ThriftHiveMetastore.java:1295)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_database(ThriftHiveMetastore.java:1282)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:1148)
>  at 
> org.apache.iceberg.hive.HiveMetastoreTest.startMetastore(HiveMetastoreTest.java:51)
> The same problem does not occur when running it from IntelliJ or on CI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-25310) Fix local test run problems with Iceberg tests: Socket closed by peer

2021-07-07 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376500#comment-17376500
 ] 

Marta Kuczora commented on HIVE-25310:
--

The problem is caused by "java.lang.NoClassDefFoundError: 
org/apache/hadoop/hdfs/protocol/SnapshotException"
It seems that the hadoop-hdfs jar which contains this class is missing from the 
iceberg-catalog and iceberg-handler tests' classpath. Adding this dependency 
with test scope to these modules solves the problem.

> Fix local test run problems with Iceberg tests: Socket closed by peer
> -
>
> Key: HIVE-25310
> URL: https://issues.apache.org/jira/browse/HIVE-25310
> Project: Hive
>  Issue Type: Test
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When running the tests in the iceberg-catalog and iceberg-handler module, 
> locally using mvn, we often get errors like this:
> [ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time elapsed: 5.022 
> s  <<< ERROR![ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time 
> elapsed: 5.022 s  <<< ERROR!org.apache.thrift.transport.TTransportException: 
> Socket is closed by peer. at 
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:181)
>  at org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) at 
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:463) 
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:361) 
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:244)
>  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_database(ThriftHiveMetastore.java:1295)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_database(ThriftHiveMetastore.java:1282)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:1148)
>  at 
> org.apache.iceberg.hive.HiveMetastoreTest.startMetastore(HiveMetastoreTest.java:51)
> The same problem does not occur when running it from IntelliJ or on CI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25310) Fix local test run problems with Iceberg tests: Socket closed by peer

2021-07-07 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25310:
-
Description: 
When running the tests in the iceberg-catalog and iceberg-handler module, 
locally using mvn, we often get errors like this:
{noformat}
[ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time elapsed: 5.022 s 
 <<< ERROR![ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time 
elapsed: 5.022 s  <<< ERROR!org.apache.thrift.transport.TTransportException: 
Socket is closed by peer. at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:181)
 at org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:463) at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:361) at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:244)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_database(ThriftHiveMetastore.java:1295)
 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_database(ThriftHiveMetastore.java:1282)
 at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:1148)
 at 
org.apache.iceberg.hive.HiveMetastoreTest.startMetastore(HiveMetastoreTest.java:51)
{noformat}

The same problem does not occur when running it from IntelliJ or on CI.

  was:
When running the tests in the iceberg-catalog and iceberg-handler module, 
locally using mvn, we often get errors like this:

[ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time elapsed: 5.022 s 
 <<< ERROR![ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time 
elapsed: 5.022 s  <<< ERROR!org.apache.thrift.transport.TTransportException: 
Socket is closed by peer. at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:181)
 at org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:463) at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:361) at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:244)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_database(ThriftHiveMetastore.java:1295)
 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_database(ThriftHiveMetastore.java:1282)
 at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:1148)
 at 
org.apache.iceberg.hive.HiveMetastoreTest.startMetastore(HiveMetastoreTest.java:51)


The same problem does not occur when running it from IntelliJ or on CI.


> Fix local test run problems with Iceberg tests: Socket closed by peer
> -
>
> Key: HIVE-25310
> URL: https://issues.apache.org/jira/browse/HIVE-25310
> Project: Hive
>  Issue Type: Test
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>
> When running the tests in the iceberg-catalog and iceberg-handler module, 
> locally using mvn, we often get errors like this:
> {noformat}
> [ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time elapsed: 5.022 
> s  <<< ERROR![ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time 
> elapsed: 5.022 s  <<< ERROR!org.apache.thrift.transport.TTransportException: 
> Socket is closed by peer. at 
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:181)
>  at org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) at 
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:463) 
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:361) 
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:244)
>  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_database(ThriftHiveMetastore.java:1295)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_database(ThriftHiveMetastore.java:1282)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:1148)
>  at 
> org.apache.iceberg.hive.HiveMetastoreTest.startMetastore(HiveMetastoreTest.java:51)
> {noformat}
> The same problem does not occur when running it from IntelliJ or on CI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25310) Fix local test run problems with Iceberg tests: Socket closed by peer

2021-07-07 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25310:
-
Description: 
When running the tests in the iceberg-catalog and iceberg-handler module, 
locally using mvn, we often get errors like this:

[ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time elapsed: 5.022 s 
 <<< ERROR![ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time 
elapsed: 5.022 s  <<< ERROR!org.apache.thrift.transport.TTransportException: 
Socket is closed by peer. at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:181)
 at org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:463) at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:361) at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:244)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_database(ThriftHiveMetastore.java:1295)
 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_database(ThriftHiveMetastore.java:1282)
 at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:1148)
 at 
org.apache.iceberg.hive.HiveMetastoreTest.startMetastore(HiveMetastoreTest.java:51)


The same problem does not occur when running it from IntelliJ or on CI.

  was:
When running the tests in the iceberg-catalog and iceberg-handler module, 
locally using mvn, we often get errors like this:
{noformat}
[ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time elapsed: 5.022 s 
 <<< ERROR![ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time 
elapsed: 5.022 s  <<< ERROR!org.apache.thrift.transport.TTransportException: 
Socket is closed by peer. at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:181)
 at org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:463) at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:361) at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:244)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_database(ThriftHiveMetastore.java:1295)
 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_database(ThriftHiveMetastore.java:1282)
 at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:1148)
 at 
org.apache.iceberg.hive.HiveMetastoreTest.startMetastore(HiveMetastoreTest.java:51)
{noformat}

The same problem does not occur when running it from IntelliJ or on CI.


> Fix local test run problems with Iceberg tests: Socket closed by peer
> -
>
> Key: HIVE-25310
> URL: https://issues.apache.org/jira/browse/HIVE-25310
> Project: Hive
>  Issue Type: Test
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>
> When running the tests in the iceberg-catalog and iceberg-handler module, 
> locally using mvn, we often get errors like this:
> [ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time elapsed: 5.022 
> s  <<< ERROR![ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time 
> elapsed: 5.022 s  <<< ERROR!org.apache.thrift.transport.TTransportException: 
> Socket is closed by peer. at 
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:181)
>  at org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) at 
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:463) 
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:361) 
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:244)
>  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_database(ThriftHiveMetastore.java:1295)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_database(ThriftHiveMetastore.java:1282)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:1148)
>  at 
> org.apache.iceberg.hive.HiveMetastoreTest.startMetastore(HiveMetastoreTest.java:51)
> The same problem does not occur when running it from IntelliJ or on CI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25310) Fix local test run problems with Iceberg tests: Socket closed by peer

2021-07-07 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25310:
-
Description: 
When running the tests in the iceberg-catalog and iceberg-handler module, 
locally using mvn, we often get errors like this:

[ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time elapsed: 5.022 s 
 <<< ERROR![ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time 
elapsed: 5.022 s  <<< ERROR!org.apache.thrift.transport.TTransportException: 
Socket is closed by peer. at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:181)
 at org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:463) at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:361) at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:244)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_database(ThriftHiveMetastore.java:1295)
 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_database(ThriftHiveMetastore.java:1282)
 at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:1148)
 at 
org.apache.iceberg.hive.HiveMetastoreTest.startMetastore(HiveMetastoreTest.java:51)


The same problem does not occur when running it from IntelliJ or on CI.

> Fix local test run problems with Iceberg tests: Socket closed by peer
> -
>
> Key: HIVE-25310
> URL: https://issues.apache.org/jira/browse/HIVE-25310
> Project: Hive
>  Issue Type: Test
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>
> When running the tests in the iceberg-catalog and iceberg-handler module, 
> locally using mvn, we often get errors like this:
> [ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time elapsed: 5.022 
> s  <<< ERROR![ERROR] org.apache.iceberg.hive.TestHiveTableConcurrency  Time 
> elapsed: 5.022 s  <<< ERROR!org.apache.thrift.transport.TTransportException: 
> Socket is closed by peer. at 
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:181)
>  at org.apache.thrift.transport.TTransport.readAll(TTransport.java:109) at 
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:463) 
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:361) 
> at 
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:244)
>  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_database(ThriftHiveMetastore.java:1295)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.create_database(ThriftHiveMetastore.java:1282)
>  at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createDatabase(HiveMetaStoreClient.java:1148)
>  at 
> org.apache.iceberg.hive.HiveMetastoreTest.startMetastore(HiveMetastoreTest.java:51)
> The same problem does not occur when running it from IntelliJ or on CI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25310) Fix local test run problems with Iceberg tests: Socket closed by peer

2021-07-07 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-25310:



> Fix local test run problems with Iceberg tests: Socket closed by peer
> -
>
> Key: HIVE-25310
> URL: https://issues.apache.org/jira/browse/HIVE-25310
> Project: Hive
>  Issue Type: Test
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25258) Incorrect row order after query-based MINOR compaction

2021-06-22 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora resolved HIVE-25258.
--
Resolution: Fixed

Pushed to master. Thanks a lot [~lpinter] for the review!

> Incorrect row order after query-based MINOR compaction
> --
>
> Key: HIVE-25258
> URL: https://issues.apache.org/jira/browse/HIVE-25258
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The query based MINOR compaction uses the following sorting order in its 
> inner query: `bucket`, `originalTransaction`, `rowId`, as it can be seen in 
> the 
> [code|https://github.com/apache/hive/blob/d0bbe76ad626244802d062b0a93a9f1cd4fc5f20/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactionQueryBuilder.java#L474-L476].
> But actually the rows should be ordered by originalTransactionId, 
> bucketProperty and rowId, otherwise the delete deltas cannot be applied 
> correctly. And this is the order what the MR MAJOR and MR MINOR compactions 
> write. 
> The sorting order used by the query-based MINOR compaction can lead to 
> duplicated rows when running the compaction after multiple merge statements. 
> This issue can be reproduced for example by running the following queries:
> {noformat}
> CREATE TABLE transactions(id int,value string) STORED AS ORC TBLPROPERTIES 
> ('transactional'='true');
> INSERT INTO transactions VALUES
> (1, 'value_01'),(2, 'value_02'),(3, 'value_03'),(4, 'value_04'),(5, 
> 'value_05'),(6, 'value_06'),(7, 'value_07'),(8, 'value_08');
> CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
> INSERT INTO merge_source_1 VALUES (1, 'newvalue_1'),(2, 'newvalue_2'),(4, 
> 'newvalue_4'),(6, 'newvalue_6'),(9, 'value_9'),(10, 'value_10'),(11, 
> 'value_11'),(12, 'value_12');
> MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value 
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> CREATE TABLE merge_source_2(ID int, value string) STORED AS ORC;
> INSERT INTO merge_source_2 VALUES
>   (2, 'newestvalue_2'),(4, 'newestvalue_4'),(6, 'newestvalue_6'),(10, 
> 'newestvalue_10'),(11, 'newestvalue_11'),(13, 'value_13'),(14, 'value_14');
> MERGE INTO transactions AS T 
> USING merge_source_2 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MINOR';
> CREATE TABLE merge_source_3(ID int, value string) STORED AS ORC;
> INSERT INTO merge_source_3 VALUES
>   (1, 'latestvalue_1'),(4, 'latestvalue_4'),(5, 'latestvalue_5'),(9, 
> 'latestvalue_9'),(11, 'latestvalue_11'),(13, 'latestvalue_13'),(15, 
> 'value_15');
> MERGE INTO transactions AS T 
> USING merge_source_3 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MINOR';
> {noformat}
> Running a select after the second compaction finished will return duplicated 
> rows:
> {noformat}
> select * from transactions order by id;
> +--+-+
> | transactions.id  | transactions.value  |
> +--+-+
> | 1| newvalue_1  |
> | 1| latestvalue_1   |
> | 2| newestvalue_2   |
> | 2| newvalue_2  |
> | 3| value_03|
> | 4| latestvalue_4   |
> | 4| newvalue_4  |
> | 5| latestvalue_5   |
> | 6| newvalue_6  |
> | 6| newestvalue_6   |
> | 7| value_07|
> | 8| value_08|
> | 9| latestvalue_9   |
> | 10   | newestvalue_10  |
> | 11   | latestvalue_11  |
> | 12   | value_12|
> | 13   | latestvalue_13  |
> | 14   | value_14|
> | 15   | value_15|
> +--+-+
> {noformat}
> If the same queries are run with MR MINOR compaction, instead of the 
> query-based MINOR compaction, the select will return the correct result:
> {noformat}
> +--+-+
> | transactions.id  | transactions.value  |
> +--+-+
> | 1| latestvalue_1 

[jira] [Commented] (HIVE-25257) Incorrect row order validation for query-based MAJOR compaction

2021-06-18 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365753#comment-17365753
 ] 

Marta Kuczora commented on HIVE-25257:
--

Pushed to master! Thanks a lot [~lpinter] for the review!

> Incorrect row order validation for query-based MAJOR compaction
> ---
>
> Key: HIVE-25257
> URL: https://issues.apache.org/jira/browse/HIVE-25257
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In the insert query of the query-based MAJOR compaction, there is this 
> function call: "validate_acid_sort_order(ROW__ID.writeId, ROW__ID.bucketId, 
> ROW__ID.rowId)".
> This is to validate if the order of the rows is correct. This validation is 
> done by the GenericUDFValidateAcidSortOrder class and it assumes that the 
> rows are in increasing order by bucketProperty, originalTransactionId and 
> rowId. 
> But actually the rows should be ordered by originalTransactionId, 
> bucketProperty and rowId, otherwise the delete deltas cannot be applied 
> correctly. And this is the order what the MR MAJOR compaction writes and how 
> the split groups are created for the query-based MAJOR compaction. It doesn't 
> cause any issue until there is only one bucketProperty in the files, but as 
> soon as there are multiple bucketProperties in the same file, the validation 
> will fail. This can be reproduced by running multiple merge statements after 
> each other.
> For example:
> {noformat}
> CREATE TABLE transactions (id int,value string) STORED AS ORC TBLPROPERTIES 
> ('transactional'='true');
> INSERT INTO transactions VALUES
> (1, 'value_1'),
> (2, 'value_2'),
> (3, 'value_3'),
> (4, 'value_4'),
> (5, 'value_5');
> CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
> INSERT INTO merge_source_1 VALUES 
> (1, 'newvalue_1'),
> (2, 'newvalue_2'),
> (3, 'newvalue_3'),
> (6, 'value_6'),
> (7, 'value_7');
> MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value 
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> CREATE TABLE merge_source_2(
>  ID int,
>  value string)
> STORED AS ORC;
> INSERT INTO merge_source_2 VALUES
> (1, 'newestvalue_1'),
> (2, 'newestvalue_2'),
> (5, 'newestvalue_5'),
> (7, 'newestvalue_7'),
> (8, 'value_18);
> MERGE INTO transactions AS T 
> USING merge_source_2 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MAJOR';
> {noformat}
> The MAJOR compaction will fail with the following error:
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Wrong sort order 
> of Acid rows detected for the rows: 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@4d3ef25e
>  and 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@1c9df436
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder.evaluate(GenericUDFValidateAcidSortOrder.java:80)
> {noformat}
> So the validation doesn't check for the correct row order. The correct order 
> is originalTransactionId, bucketProperty, rowId.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-25257) Incorrect row order validation for query-based MAJOR compaction

2021-06-18 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora resolved HIVE-25257.
--
Resolution: Fixed

> Incorrect row order validation for query-based MAJOR compaction
> ---
>
> Key: HIVE-25257
> URL: https://issues.apache.org/jira/browse/HIVE-25257
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> In the insert query of the query-based MAJOR compaction, there is this 
> function call: "validate_acid_sort_order(ROW__ID.writeId, ROW__ID.bucketId, 
> ROW__ID.rowId)".
> This is to validate if the order of the rows is correct. This validation is 
> done by the GenericUDFValidateAcidSortOrder class and it assumes that the 
> rows are in increasing order by bucketProperty, originalTransactionId and 
> rowId. 
> But actually the rows should be ordered by originalTransactionId, 
> bucketProperty and rowId, otherwise the delete deltas cannot be applied 
> correctly. And this is the order what the MR MAJOR compaction writes and how 
> the split groups are created for the query-based MAJOR compaction. It doesn't 
> cause any issue until there is only one bucketProperty in the files, but as 
> soon as there are multiple bucketProperties in the same file, the validation 
> will fail. This can be reproduced by running multiple merge statements after 
> each other.
> For example:
> {noformat}
> CREATE TABLE transactions (id int,value string) STORED AS ORC TBLPROPERTIES 
> ('transactional'='true');
> INSERT INTO transactions VALUES
> (1, 'value_1'),
> (2, 'value_2'),
> (3, 'value_3'),
> (4, 'value_4'),
> (5, 'value_5');
> CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
> INSERT INTO merge_source_1 VALUES 
> (1, 'newvalue_1'),
> (2, 'newvalue_2'),
> (3, 'newvalue_3'),
> (6, 'value_6'),
> (7, 'value_7');
> MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value 
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> CREATE TABLE merge_source_2(
>  ID int,
>  value string)
> STORED AS ORC;
> INSERT INTO merge_source_2 VALUES
> (1, 'newestvalue_1'),
> (2, 'newestvalue_2'),
> (5, 'newestvalue_5'),
> (7, 'newestvalue_7'),
> (8, 'value_18);
> MERGE INTO transactions AS T 
> USING merge_source_2 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MAJOR';
> {noformat}
> The MAJOR compaction will fail with the following error:
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Wrong sort order 
> of Acid rows detected for the rows: 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@4d3ef25e
>  and 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@1c9df436
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder.evaluate(GenericUDFValidateAcidSortOrder.java:80)
> {noformat}
> So the validation doesn't check for the correct row order. The correct order 
> is originalTransactionId, bucketProperty, rowId.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25264) Add tests to verify Hive can read/write after schema change on Iceberg table

2021-06-18 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25264:
-
Description: 
We should verify if Hive can properly read/write Iceberg tables after their 
schema was modified through the Iceberg API (it's like when an other engine, 
like Spark has done modification on the schema). 
Unit tests should be added for the following operations offered by the 
UpdateSchema interface in the Iceberg API:
- adding new top level column
- adding new nested column
- adding required column
- adding required nested column
- renaming a column
- updating a column
- making a column required
- delete a column
- changing the order of the columns in the schema


> Add tests to verify Hive can read/write after schema change on Iceberg table
> 
>
> Key: HIVE-25264
> URL: https://issues.apache.org/jira/browse/HIVE-25264
> Project: Hive
>  Issue Type: Test
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>
> We should verify if Hive can properly read/write Iceberg tables after their 
> schema was modified through the Iceberg API (it's like when an other engine, 
> like Spark has done modification on the schema). 
> Unit tests should be added for the following operations offered by the 
> UpdateSchema interface in the Iceberg API:
> - adding new top level column
> - adding new nested column
> - adding required column
> - adding required nested column
> - renaming a column
> - updating a column
> - making a column required
> - delete a column
> - changing the order of the columns in the schema



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25264) Add tests to verify Hive can read/write after schema change on Iceberg table

2021-06-18 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-25264:



> Add tests to verify Hive can read/write after schema change on Iceberg table
> 
>
> Key: HIVE-25264
> URL: https://issues.apache.org/jira/browse/HIVE-25264
> Project: Hive
>  Issue Type: Test
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25258) Incorrect row order after query-based MINOR compaction

2021-06-16 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25258:
-
Description: 
The query based MINOR compaction uses the following sorting order in its inner 
query: `bucket`, `originalTransaction`, `rowId`, as it can be seen in the 
[code|https://github.com/apache/hive/blob/d0bbe76ad626244802d062b0a93a9f1cd4fc5f20/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactionQueryBuilder.java#L474-L476].

But actually the rows should be ordered by originalTransactionId, 
bucketProperty and rowId, otherwise the delete deltas cannot be applied 
correctly. And this is the order what the MR MAJOR and MR MINOR compactions 
write. 
The sorting order used by the query-based MINOR compaction can lead to 
duplicated rows when running the compaction after multiple merge statements. 
This issue can be reproduced for example by running the following queries:
{noformat}
CREATE TABLE transactions(id int,value string) STORED AS ORC TBLPROPERTIES 
('transactional'='true');
INSERT INTO transactions VALUES
(1, 'value_01'),(2, 'value_02'),(3, 'value_03'),(4, 'value_04'),(5, 
'value_05'),(6, 'value_06'),(7, 'value_07'),(8, 'value_08');


CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
INSERT INTO merge_source_1 VALUES (1, 'newvalue_1'),(2, 'newvalue_2'),(4, 
'newvalue_4'),(6, 'newvalue_6'),(9, 'value_9'),(10, 'value_10'),(11, 
'value_11'),(12, 'value_12');

MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value 
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);


CREATE TABLE merge_source_2(ID int, value string) STORED AS ORC;
INSERT INTO merge_source_2 VALUES
  (2, 'newestvalue_2'),(4, 'newestvalue_4'),(6, 'newestvalue_6'),(10, 
'newestvalue_10'),(11, 'newestvalue_11'),(13, 'value_13'),(14, 'value_14');

MERGE INTO transactions AS T 
USING merge_source_2 AS S
ON T.ID = S.ID
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);


ALTER TABLE transactions COMPACT 'MINOR';


CREATE TABLE merge_source_3(ID int, value string) STORED AS ORC;
INSERT INTO merge_source_3 VALUES
  (1, 'latestvalue_1'),(4, 'latestvalue_4'),(5, 'latestvalue_5'),(9, 
'latestvalue_9'),(11, 'latestvalue_11'),(13, 'latestvalue_13'),(15, 'value_15');

MERGE INTO transactions AS T 
USING merge_source_3 AS S
ON T.ID = S.ID
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);

ALTER TABLE transactions COMPACT 'MINOR';
{noformat}

Running a select after the second compaction finished will return duplicated 
rows:
{noformat}
select * from transactions order by id;

+--+-+
| transactions.id  | transactions.value  |
+--+-+
| 1| newvalue_1  |
| 1| latestvalue_1   |
| 2| newestvalue_2   |
| 2| newvalue_2  |
| 3| value_03|
| 4| latestvalue_4   |
| 4| newvalue_4  |
| 5| latestvalue_5   |
| 6| newvalue_6  |
| 6| newestvalue_6   |
| 7| value_07|
| 8| value_08|
| 9| latestvalue_9   |
| 10   | newestvalue_10  |
| 11   | latestvalue_11  |
| 12   | value_12|
| 13   | latestvalue_13  |
| 14   | value_14|
| 15   | value_15|
+--+-+
{noformat}

If the same queries are run with MR MINOR compaction, instead of the 
query-based MINOR compaction, the select will return the correct result:
{noformat}
+--+-+
| transactions.id  | transactions.value  |
+--+-+
| 1| latestvalue_1   |
| 2| newestvalue_2   |
| 3| value_03|
| 4| latestvalue_4   |
| 5| latestvalue_5   |
| 6| newestvalue_6   |
| 7| value_07|
| 8| value_08|
| 9| latestvalue_9   |
| 10   | newestvalue_10  |
| 11   | latestvalue_11  |
| 12   | value_12|
| 13   | latestvalue_13  |
| 14   | value_14|
| 15   | value_15|
+--+-+
{noformat}

The content of the bucket files in the delta and delete delta directories after 
the query-based and MR 

[jira] [Updated] (HIVE-25258) Incorrect row order after query-based MINOR compaction

2021-06-16 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25258:
-
Description: 
The query based MINOR compaction uses the following sorting order in its inner 
query: `bucket`, `originalTransaction`, `rowId`, as it can be seen in the 
[code|https://github.com/apache/hive/blob/d0bbe76ad626244802d062b0a93a9f1cd4fc5f20/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactionQueryBuilder.java#L474-L476].

But actually the rows should be ordered by originalTransactionId, 
bucketProperty and rowId, otherwise the delete deltas cannot be applied 
correctly. And this is the order what the MR MAJOR and MR MINOR compactions 
write. 
The sorting order used by the query-based MINOR compaction can lead to 
duplicated rows when running the compaction after multiple merge statements. 
This issue can be reproduced for example by running the following queries:
{noformat}
CREATE TABLE transactions(id int,value string) STORED AS ORC TBLPROPERTIES 
('transactional'='true');
INSERT INTO transactions VALUES
(1, 'value_01'),(2, 'value_02'),(3, 'value_03'),(4, 'value_04'),(5, 
'value_05'),(6, 'value_06'),(7, 'value_07'),(8, 'value_08');


CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
INSERT INTO merge_source_1 VALUES (1, 'newvalue_1'),(2, 'newvalue_2'),(4, 
'newvalue_4'),(6, 'newvalue_6'),(9, 'value_9'),(10, 'value_10'),(11, 
'value_11'),(12, 'value_12');

MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value 
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);


CREATE TABLE merge_source_2(ID int, value string) STORED AS ORC;
INSERT INTO merge_source_2 VALUES
  (2, 'newestvalue_2'),(4, 'newestvalue_4'),(6, 'newestvalue_6'),(10, 
'newestvalue_10'),(11, 'newestvalue_11'),(13, 'value_13'),(14, 'value_14');

MERGE INTO transactions AS T 
USING merge_source_2 AS S
ON T.ID = S.ID
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);


ALTER TABLE transactions COMPACT 'MINOR';


CREATE TABLE merge_source_3(ID int, value string) STORED AS ORC;
INSERT INTO merge_source_3 VALUES
  (1, 'latestvalue_1'),(4, 'latestvalue_4'),(5, 'latestvalue_5'),(9, 
'latestvalue_9'),(11, 'latestvalue_11'),(13, 'latestvalue_13'),(15, 'value_15');

MERGE INTO transactions AS T 
USING merge_source_3 AS S
ON T.ID = S.ID
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);

ALTER TABLE transactions COMPACT 'MINOR';
{noformat}

Running a select after the second compaction finished will return duplicated 
rows:
{noformat}
select * from transactions order by id;

+--+-+
| transactions.id  | transactions.value  |
+--+-+
| 1| newvalue_1  |
| 1| latestvalue_1   |
| 2| newestvalue_2   |
| 2| newvalue_2  |
| 3| value_03|
| 4| latestvalue_4   |
| 4| newvalue_4  |
| 5| latestvalue_5   |
| 6| newvalue_6  |
| 6| newestvalue_6   |
| 7| value_07|
| 8| value_08|
| 9| latestvalue_9   |
| 10   | newestvalue_10  |
| 11   | latestvalue_11  |
| 12   | value_12|
| 13   | latestvalue_13  |
| 14   | value_14|
| 15   | value_15|
+--+-+
{noformat}

If the same queries are run with MR MINOR compaction, instead of the 
query-based MINOR compaction, the select will return the correct result:
{noformat}
+--+-+
| transactions.id  | transactions.value  |
+--+-+
| 1| latestvalue_1   |
| 2| newestvalue_2   |
| 3| value_03|
| 4| latestvalue_4   |
| 5| latestvalue_5   |
| 6| newestvalue_6   |
| 7| value_07|
| 8| value_08|
| 9| latestvalue_9   |
| 10   | newestvalue_10  |
| 11   | latestvalue_11  |
| 12   | value_12|
| 13   | latestvalue_13  |
| 14   | value_14|
| 15   | value_15|
+--+-+
{noformat}

The content of the bucket files in the delta and delete delta directories after 
the query-based and MR 

[jira] [Updated] (HIVE-25258) Incorrect row order after query-based MINOR compaction

2021-06-16 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25258:
-
Description: 
The query based MINOR compaction uses the following sorting order in its inner 
query: `bucket`, `originalTransaction`, `rowId`, as it can be seen in the 
[code|https://github.com/apache/hive/blob/d0bbe76ad626244802d062b0a93a9f1cd4fc5f20/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactionQueryBuilder.java#L474-L476].

But actually the rows should be ordered by originalTransactionId, 
bucketProperty and rowId, otherwise the delete deltas cannot be applied 
correctly. And this is the order what the MR MAJOR and MR MINOR compactions 
write. 
The sorting order used by the query-based MINOR compaction can lead to 
duplicated rows when running the compaction after multiple merge statements. 
This issue can be reproduced for example by running the following queries:
{noformat}
CREATE TABLE transactions(id int,value string) STORED AS ORC TBLPROPERTIES 
('transactional'='true');
INSERT INTO transactions VALUES
(1, 'value_01'),(2, 'value_02'),(3, 'value_03'),(4, 'value_04'),(5, 
'value_05'),(6, 'value_06'),(7, 'value_07'),(8, 'value_08');


CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
INSERT INTO merge_source_1 VALUES (1, 'newvalue_1'),(2, 'newvalue_2'),(4, 
'newvalue_4'),(6, 'newvalue_6'),(9, 'value_9'),(10, 'value_10'),(11, 
'value_11'),(12, 'value_12');

MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value 
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);


CREATE TABLE merge_source_2(ID int, value string) STORED AS ORC;
INSERT INTO merge_source_2 VALUES
  (2, 'newestvalue_2'),(4, 'newestvalue_4'),(6, 'newestvalue_6'),(10, 
'newestvalue_10'),(11, 'newestvalue_11'),(13, 'value_13'),(14, 'value_14');

MERGE INTO transactions AS T 
USING merge_source_2 AS S
ON T.ID = S.ID
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);


ALTER TABLE transactions COMPACT 'MINOR';


CREATE TABLE merge_source_3(ID int, value string) STORED AS ORC;
INSERT INTO merge_source_3 VALUES
  (1, 'latestvalue_1'),(4, 'latestvalue_4'),(5, 'latestvalue_5'),(9, 
'latestvalue_9'),(11, 'latestvalue_11'),(13, 'latestvalue_13'),(15, 'value_15');

MERGE INTO transactions AS T 
USING merge_source_3 AS S
ON T.ID = S.ID
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);

ALTER TABLE transactions COMPACT 'MINOR';
{noformat}

Running a select after the second compaction finished will return duplicated 
rows:
{noformat}
select * from transactions order by id;

+--+-+
| transactions.id  | transactions.value  |
+--+-+
| 1| newvalue_1  |
| 1| latestvalue_1   |
| 2| newestvalue_2   |
| 2| newvalue_2  |
| 3| value_03|
| 4| latestvalue_4   |
| 4| newvalue_4  |
| 5| latestvalue_5   |
| 6| newvalue_6  |
| 6| newestvalue_6   |
| 7| value_07|
| 8| value_08|
| 9| latestvalue_9   |
| 10   | newestvalue_10  |
| 11   | latestvalue_11  |
| 12   | value_12|
| 13   | latestvalue_13  |
| 14   | value_14|
| 15   | value_15|
+--+-+
{noformat}

If the same queries are run with MR MINOR compaction, instead of the 
query-based MINOR compaction, the select will return the correct result:
{noformat}
+--+-+
| transactions.id  | transactions.value  |
+--+-+
| 1| latestvalue_1   |
| 2| newestvalue_2   |
| 3| value_03|
| 4| latestvalue_4   |
| 5| latestvalue_5   |
| 6| newestvalue_6   |
| 7| value_07|
| 8| value_08|
| 9| latestvalue_9   |
| 10   | newestvalue_10  |
| 11   | latestvalue_11  |
| 12   | value_12|
| 13   | latestvalue_13  |
| 14   | value_14|
| 15   | value_15|
+--+-+
{noformat}

The content of the bucket files in the delta and delete delta directories after 
the query-based and MR 

[jira] [Updated] (HIVE-25258) Incorrect row order after query-based MINOR compaction

2021-06-16 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25258:
-
Description: 
The query based MINOR compaction uses the following sorting order in its inner 
query: `bucket`, `originalTransaction`, `rowId`, as it can be seen in the 
[code|https://github.com/apache/hive/blob/d0bbe76ad626244802d062b0a93a9f1cd4fc5f20/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactionQueryBuilder.java#L474-L476].

But actually the rows should be ordered by originalTransactionId, 
bucketProperty and rowId, otherwise the delete deltas cannot be applied 
correctly. And this is the order what the MR MAJOR and MR MINOR compactions 
write. 
The sorting order used by the query-based MINOR compaction can lead to 
duplicated rows when running the compaction after multiple merge statements. 
This issue can be reproduced for example by running the following queries:
{noformat}
CREATE TABLE transactions(id int,value string) STORED AS ORC TBLPROPERTIES 
('transactional'='true');
INSERT INTO transactions VALUES
(1, 'value_01'),(2, 'value_02'),(3, 'value_03'),(4, 'value_04'),(5, 
'value_05'),(6, 'value_06'),(7, 'value_07'),(8, 'value_08');


CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
INSERT INTO merge_source_1 VALUES (1, 'newvalue_1'),(2, 'newvalue_2'),(4, 
'newvalue_4'),(6, 'newvalue_6'),(9, 'value_9'),(10, 'value_10'),(11, 
'value_11'),(12, 'value_12');

MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value 
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);


CREATE TABLE merge_source_2(ID int, value string) STORED AS ORC;
INSERT INTO merge_source_2 VALUES
  (2, 'newestvalue_2'),(4, 'newestvalue_4'),(6, 'newestvalue_6'),(10, 
'newestvalue_10'),(11, 'newestvalue_11'),(13, 'value_13'),(14, 'value_14');

MERGE INTO transactions AS T 
USING merge_source_2 AS S
ON T.ID = S.ID
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);


ALTER TABLE transactions COMPACT 'MINOR';


CREATE TABLE merge_source_3(ID int, value string) STORED AS ORC;
INSERT INTO merge_source_3 VALUES
  (1, 'latestvalue_1'),(4, 'latestvalue_4'),(5, 'latestvalue_5'),(9, 
'latestvalue_9'),(11, 'latestvalue_11'),(13, 'latestvalue_13'),(15, 'value_15');

MERGE INTO transactions AS T 
USING merge_source_3 AS S
ON T.ID = S.ID
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);

ALTER TABLE transactions COMPACT 'MINOR';
{noformat}

Running a select after the second compaction finished will return duplicated 
rows:
{noformat}
select * from transactions order by id;

+--+-+
| transactions.id  | transactions.value  |
+--+-+
| 1| newvalue_1  |
| 1| latestvalue_1   |
| 2| newestvalue_2   |
| 2| newvalue_2  |
| 3| value_03|
| 4| latestvalue_4   |
| 4| newvalue_4  |
| 5| latestvalue_5   |
| 6| newvalue_6  |
| 6| newestvalue_6   |
| 7| value_07|
| 8| value_08|
| 9| latestvalue_9   |
| 10   | newestvalue_10  |
| 11   | latestvalue_11  |
| 12   | value_12|
| 13   | latestvalue_13  |
| 14   | value_14|
| 15   | value_15|
+--+-+
{noformat}

If the same queries are run with MR MINOR compaction, instead of the 
query-based MINOR compaction, the select will return the correct result:
{noformat}
+--+-+
| transactions.id  | transactions.value  |
+--+-+
| 1| newvalue_1  |
| 2| newestvalue_2   |
| 3| value_03|
| 4| newestvalue_4   |
| 5| value_05|
| 6| newestvalue_6   |
| 7| value_07|
| 8| value_08|
| 9| value_9 |
| 10   | newestvalue_10  |
| 11   | newestvalue_11  |
| 12   | value_12|
| 13   | value_13|
| 14   | value_14|
+--+---
{noformat}


  was:Details will be added soon.


> Incorrect row order after query-based MINOR compaction
> --
>

[jira] [Updated] (HIVE-25258) Incorrect row order after query-based MINOR compaction

2021-06-16 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25258:
-
Fix Version/s: 4.0.0

> Incorrect row order after query-based MINOR compaction
> --
>
> Key: HIVE-25258
> URL: https://issues.apache.org/jira/browse/HIVE-25258
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25258) Incorrect row order after query-based MINOR compaction

2021-06-16 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25258:
-
Component/s: Transactions

> Incorrect row order after query-based MINOR compaction
> --
>
> Key: HIVE-25258
> URL: https://issues.apache.org/jira/browse/HIVE-25258
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25258) Incorrect row order after query-based MINOR compaction

2021-06-16 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25258:
-
Description: Detail will be added soon.

> Incorrect row order after query-based MINOR compaction
> --
>
> Key: HIVE-25258
> URL: https://issues.apache.org/jira/browse/HIVE-25258
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> Detail will be added soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25258) Incorrect row order after query-based MINOR compaction

2021-06-16 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25258:
-
Description: Details will be added soon.  (was: Detail will be added soon.)

> Incorrect row order after query-based MINOR compaction
> --
>
> Key: HIVE-25258
> URL: https://issues.apache.org/jira/browse/HIVE-25258
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> Details will be added soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25257) Incorrect row order validation for query-based MAJOR compaction

2021-06-16 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25257:
-
Component/s: Transactions

> Incorrect row order validation for query-based MAJOR compaction
> ---
>
> Key: HIVE-25257
> URL: https://issues.apache.org/jira/browse/HIVE-25257
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> In the insert query of the query-based MAJOR compaction, there is this 
> function call: "validate_acid_sort_order(ROW__ID.writeId, ROW__ID.bucketId, 
> ROW__ID.rowId)".
> This is to validate if the order of the rows is correct. This validation is 
> done by the GenericUDFValidateAcidSortOrder class and it assumes that the 
> rows are in increasing order by bucketProperty, originalTransactionId and 
> rowId. 
> But actually the rows should be ordered by originalTransactionId, 
> bucketProperty and rowId, otherwise the delete deltas cannot be applied 
> correctly. And this is the order what the MR MAJOR compaction writes and how 
> the split groups are created for the query-based MAJOR compaction. It doesn't 
> cause any issue until there is only one bucketProperty in the files, but as 
> soon as there are multiple bucketProperties in the same file, the validation 
> will fail. This can be reproduced by running multiple merge statements after 
> each other.
> For example:
> {noformat}
> CREATE TABLE transactions (id int,value string) STORED AS ORC TBLPROPERTIES 
> ('transactional'='true');
> INSERT INTO transactions VALUES
> (1, 'value_1'),
> (2, 'value_2'),
> (3, 'value_3'),
> (4, 'value_4'),
> (5, 'value_5');
> CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
> INSERT INTO merge_source_1 VALUES 
> (1, 'newvalue_1'),
> (2, 'newvalue_2'),
> (3, 'newvalue_3'),
> (6, 'value_6'),
> (7, 'value_7');
> MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value 
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> CREATE TABLE merge_source_2(
>  ID int,
>  value string)
> STORED AS ORC;
> INSERT INTO merge_source_2 VALUES
> (1, 'newestvalue_1'),
> (2, 'newestvalue_2'),
> (5, 'newestvalue_5'),
> (7, 'newestvalue_7'),
> (8, 'value_18);
> MERGE INTO transactions AS T 
> USING merge_source_2 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MAJOR';
> {noformat}
> The MAJOR compaction will fail with the following error:
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Wrong sort order 
> of Acid rows detected for the rows: 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@4d3ef25e
>  and 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@1c9df436
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder.evaluate(GenericUDFValidateAcidSortOrder.java:80)
> {noformat}
> So the validation doesn't check for the correct row order. The correct order 
> is originalTransactionId, bucketProperty, rowId.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25257) Incorrect row order validation for query-based MAJOR compaction

2021-06-16 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25257:
-
Fix Version/s: 4.0.0

> Incorrect row order validation for query-based MAJOR compaction
> ---
>
> Key: HIVE-25257
> URL: https://issues.apache.org/jira/browse/HIVE-25257
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> In the insert query of the query-based MAJOR compaction, there is this 
> function call: "validate_acid_sort_order(ROW__ID.writeId, ROW__ID.bucketId, 
> ROW__ID.rowId)".
> This is to validate if the order of the rows is correct. This validation is 
> done by the GenericUDFValidateAcidSortOrder class and it assumes that the 
> rows are in increasing order by bucketProperty, originalTransactionId and 
> rowId. 
> But actually the rows should be ordered by originalTransactionId, 
> bucketProperty and rowId, otherwise the delete deltas cannot be applied 
> correctly. And this is the order what the MR MAJOR compaction writes and how 
> the split groups are created for the query-based MAJOR compaction. It doesn't 
> cause any issue until there is only one bucketProperty in the files, but as 
> soon as there are multiple bucketProperties in the same file, the validation 
> will fail. This can be reproduced by running multiple merge statements after 
> each other.
> For example:
> {noformat}
> CREATE TABLE transactions (id int,value string) STORED AS ORC TBLPROPERTIES 
> ('transactional'='true');
> INSERT INTO transactions VALUES
> (1, 'value_1'),
> (2, 'value_2'),
> (3, 'value_3'),
> (4, 'value_4'),
> (5, 'value_5');
> CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
> INSERT INTO merge_source_1 VALUES 
> (1, 'newvalue_1'),
> (2, 'newvalue_2'),
> (3, 'newvalue_3'),
> (6, 'value_6'),
> (7, 'value_7');
> MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value 
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> CREATE TABLE merge_source_2(
>  ID int,
>  value string)
> STORED AS ORC;
> INSERT INTO merge_source_2 VALUES
> (1, 'newestvalue_1'),
> (2, 'newestvalue_2'),
> (5, 'newestvalue_5'),
> (7, 'newestvalue_7'),
> (8, 'value_18);
> MERGE INTO transactions AS T 
> USING merge_source_2 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MAJOR';
> {noformat}
> The MAJOR compaction will fail with the following error:
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Wrong sort order 
> of Acid rows detected for the rows: 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@4d3ef25e
>  and 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@1c9df436
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder.evaluate(GenericUDFValidateAcidSortOrder.java:80)
> {noformat}
> So the validation doesn't check for the correct row order. The correct order 
> is originalTransactionId, bucketProperty, rowId.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-25258) Incorrect row order after query-based MINOR compaction

2021-06-16 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-25258:



> Incorrect row order after query-based MINOR compaction
> --
>
> Key: HIVE-25258
> URL: https://issues.apache.org/jira/browse/HIVE-25258
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (HIVE-25257) Incorrect row order validation for query-based MAJOR compaction

2021-06-16 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-25257 started by Marta Kuczora.

> Incorrect row order validation for query-based MAJOR compaction
> ---
>
> Key: HIVE-25257
> URL: https://issues.apache.org/jira/browse/HIVE-25257
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>
> In the insert query of the query-based MAJOR compaction, there is this 
> function call: "validate_acid_sort_order(ROW__ID.writeId, ROW__ID.bucketId, 
> ROW__ID.rowId)".
> This is to validate if the order of the rows is correct. This validation is 
> done by the GenericUDFValidateAcidSortOrder class and it assumes that the 
> rows are in increasing order by bucketProperty, originalTransactionId and 
> rowId. 
> But actually the rows should be ordered by originalTransactionId, 
> bucketProperty and rowId, otherwise the delete deltas cannot be applied 
> correctly. And this is the order what the MR MAJOR compaction writes and how 
> the split groups are created for the query-based MAJOR compaction. It doesn't 
> cause any issue until there is only one bucketProperty in the files, but as 
> soon as there are multiple bucketProperties in the same file, the validation 
> will fail. This can be reproduced by running multiple merge statements after 
> each other.
> For example:
> {noformat}
> CREATE TABLE transactions (id int,value string) STORED AS ORC TBLPROPERTIES 
> ('transactional'='true');
> INSERT INTO transactions VALUES
> (1, 'value_1'),
> (2, 'value_2'),
> (3, 'value_3'),
> (4, 'value_4'),
> (5, 'value_5');
> CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
> INSERT INTO merge_source_1 VALUES 
> (1, 'newvalue_1'),
> (2, 'newvalue_2'),
> (3, 'newvalue_3'),
> (6, 'value_6'),
> (7, 'value_7');
> MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value 
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> CREATE TABLE merge_source_2(
>  ID int,
>  value string)
> STORED AS ORC;
> INSERT INTO merge_source_2 VALUES
> (1, 'newestvalue_1'),
> (2, 'newestvalue_2'),
> (5, 'newestvalue_5'),
> (7, 'newestvalue_7'),
> (8, 'value_18);
> MERGE INTO transactions AS T 
> USING merge_source_2 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MAJOR';
> {noformat}
> The MAJOR compaction will fail with the following error:
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Wrong sort order 
> of Acid rows detected for the rows: 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@4d3ef25e
>  and 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@1c9df436
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder.evaluate(GenericUDFValidateAcidSortOrder.java:80)
> {noformat}
> So the validation doesn't check for the correct row order. The correct order 
> is originalTransactionId, bucketProperty, rowId.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-25257) Incorrect row order validation for query-based MAJOR compaction

2021-06-16 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25257:
-
Description: 
In the insert query of the query-based MAJOR compaction, there is this function 
call: "validate_acid_sort_order(ROW__ID.writeId, ROW__ID.bucketId, 
ROW__ID.rowId)".
This is to validate if the order of the rows is correct. This validation is 
done by the GenericUDFValidateAcidSortOrder class and it assumes that the rows 
are in increasing order by bucketProperty, originalTransactionId and rowId. 

But actually the rows should be ordered by originalTransactionId, 
bucketProperty and rowId, otherwise the delete deltas cannot be applied 
correctly. And this is the order what the MR MAJOR compaction writes and how 
the split groups are created for the query-based MAJOR compaction. It doesn't 
cause any issue until there is only one bucketProperty in the files, but as 
soon as there are multiple bucketProperties in the same file, the validation 
will fail. This can be reproduced by running multiple merge statements after 
each other.
For example:
{noformat}
CREATE TABLE transactions (id int,value string) STORED AS ORC TBLPROPERTIES 
('transactional'='true');

INSERT INTO transactions VALUES
(1, 'value_1'),
(2, 'value_2'),
(3, 'value_3'),
(4, 'value_4'),
(5, 'value_5');

CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
INSERT INTO merge_source_1 VALUES 
(1, 'newvalue_1'),
(2, 'newvalue_2'),
(3, 'newvalue_3'),
(6, 'value_6'),
(7, 'value_7');

MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value 
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);

CREATE TABLE merge_source_2(
 ID int,
 value string)
STORED AS ORC;

INSERT INTO merge_source_2 VALUES
(1, 'newestvalue_1'),
(2, 'newestvalue_2'),
(5, 'newestvalue_5'),
(7, 'newestvalue_7'),
(8, 'value_18);

MERGE INTO transactions AS T 
USING merge_source_2 AS S
ON T.ID = S.ID
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);

ALTER TABLE transactions COMPACT 'MAJOR';
{noformat}
The MAJOR compaction will fail with the following error:
{noformat}
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Wrong sort order 
of Acid rows detected for the rows: 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@4d3ef25e
 and 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@1c9df436
at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder.evaluate(GenericUDFValidateAcidSortOrder.java:80)
{noformat}
So the validation doesn't check for the correct row order. The correct order is 
originalTransactionId, bucketProperty, rowId.

> Incorrect row order validation for query-based MAJOR compaction
> ---
>
> Key: HIVE-25257
> URL: https://issues.apache.org/jira/browse/HIVE-25257
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>
> In the insert query of the query-based MAJOR compaction, there is this 
> function call: "validate_acid_sort_order(ROW__ID.writeId, ROW__ID.bucketId, 
> ROW__ID.rowId)".
> This is to validate if the order of the rows is correct. This validation is 
> done by the GenericUDFValidateAcidSortOrder class and it assumes that the 
> rows are in increasing order by bucketProperty, originalTransactionId and 
> rowId. 
> But actually the rows should be ordered by originalTransactionId, 
> bucketProperty and rowId, otherwise the delete deltas cannot be applied 
> correctly. And this is the order what the MR MAJOR compaction writes and how 
> the split groups are created for the query-based MAJOR compaction. It doesn't 
> cause any issue until there is only one bucketProperty in the files, but as 
> soon as there are multiple bucketProperties in the same file, the validation 
> will fail. This can be reproduced by running multiple merge statements after 
> each other.
> For example:
> {noformat}
> CREATE TABLE transactions (id int,value string) STORED AS ORC TBLPROPERTIES 
> ('transactional'='true');
> INSERT INTO transactions VALUES
> (1, 'value_1'),
> (2, 'value_2'),
> (3, 'value_3'),
> (4, 'value_4'),
> (5, 'value_5');
> CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
> INSERT INTO merge_source_1 VALUES 
> (1, 'newvalue_1'),
> (2, 'newvalue_2'),
> (3, 'newvalue_3'),
> (6, 'value_6'),
> (7, 'value_7');
> MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value 
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> 

[jira] [Assigned] (HIVE-25257) Incorrect row order validation for query-based MAJOR compaction

2021-06-16 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-25257:



> Incorrect row order validation for query-based MAJOR compaction
> ---
>
> Key: HIVE-25257
> URL: https://issues.apache.org/jira/browse/HIVE-25257
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24642) Multiple file listing calls are executed in the MoveTask in case of direct inserts

2021-01-15 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-24642:



> Multiple file listing calls are executed in the MoveTask in case of direct 
> inserts
> --
>
> Key: HIVE-24642
> URL: https://issues.apache.org/jira/browse/HIVE-24642
> Project: Hive
>  Issue Type: Improvement
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Minor
>
> When inserting data into a table with dynamic partitioning with direct insert 
> on, the MoveTask performs several file listings to look up the newly created 
> partitions and files. Check if all files listings are necessary or it can be 
> optimized to do less listings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24642) Multiple file listing calls are executed in the MoveTask in case of direct inserts

2021-01-15 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-24642:
-
Affects Version/s: 4.0.0

> Multiple file listing calls are executed in the MoveTask in case of direct 
> inserts
> --
>
> Key: HIVE-24642
> URL: https://issues.apache.org/jira/browse/HIVE-24642
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Minor
>
> When inserting data into a table with dynamic partitioning with direct insert 
> on, the MoveTask performs several file listings to look up the newly created 
> partitions and files. Check if all files listings are necessary or it can be 
> optimized to do less listings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24581) Remove AcidUtils call from OrcInputformat for non transactional tables

2021-01-11 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora resolved HIVE-24581.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

> Remove AcidUtils call from OrcInputformat for non transactional tables
> --
>
> Key: HIVE-24581
> URL: https://issues.apache.org/jira/browse/HIVE-24581
> Project: Hive
>  Issue Type: Improvement
>Reporter: Peter Varga
>Assignee: Peter Varga
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Currently the split generation in OrcInputformat is tightly coupled with acid 
> and AcidUtils.getAcidState is called even if the table is not transactional. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24581) Remove AcidUtils call from OrcInputformat for non transactional tables

2021-01-11 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17262504#comment-17262504
 ] 

Marta Kuczora commented on HIVE-24581:
--

Pushed to master. Thanks a lot [~pvargacl] for the fix.

> Remove AcidUtils call from OrcInputformat for non transactional tables
> --
>
> Key: HIVE-24581
> URL: https://issues.apache.org/jira/browse/HIVE-24581
> Project: Hive
>  Issue Type: Improvement
>Reporter: Peter Varga
>Assignee: Peter Varga
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Currently the split generation in OrcInputformat is tightly coupled with acid 
> and AcidUtils.getAcidState is called even if the table is not transactional. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24530) Potential NPE in FileSinkOperator.closeRecordwriters method

2020-12-15 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora resolved HIVE-24530.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Pushed to master.
Thanks a lot [~szita] for the review!

> Potential NPE in FileSinkOperator.closeRecordwriters method
> ---
>
> Key: HIVE-24530
> URL: https://issues.apache.org/jira/browse/HIVE-24530
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> During testing a NPE occurred in the FileSinkOperator.closeRecordwriters 
> method.
> After investigating, turned out there was an underlaying IOException during 
> executing the FileSinkOperator.process method. It got caught by the following 
> code part:
> {noformat}
> } catch (IOException e) {
>   closeWriters(true);
>   throw new HiveException(e);
> } catch (SerDeException e) {
>   closeWriters(true);
>   throw new HiveException(e);
> }
> {noformat}
> First the closeWriters method was called:
> {noformat}
>   private void closeWriters(boolean abort) throws HiveException {
> fpaths.closeWriters(true);
> closeRecordwriters(true);
>   }
>   private void closeRecordwriters(boolean abort) {
> for (RecordWriter writer : rowOutWriters) {
>   try {
> LOG.info("Closing {} on exception", writer);
> writer.close(abort);
>   } catch (IOException e) {
> LOG.error("Error closing rowOutWriter" + writer, e);
>   }
> }
> {noformat}
> If the writers had got closed successfully, a HiveException would have been 
> thrown with the original IOException.
> But when the IOException occurred the writers in the rowOutWriters were not 
> yet initialised, so a NPE occurred. This was very misleading as the NPE was 
> not the real issue, but the original IOException was hidden.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24322) In case of direct insert, the attempt ID has to be checked when reading the manifest files

2020-12-15 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora resolved HIVE-24322.
--
Resolution: Fixed

> In case of direct insert, the attempt ID has to be checked when reading the 
> manifest files
> --
>
> Key: HIVE-24322
> URL: https://issues.apache.org/jira/browse/HIVE-24322
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In IMPALA-10247 there was an exception from Hive when tyring to load the data:
> {noformat}
> 2020-10-13T16:50:53,424 ERROR [HiveServer2-Background-Pool: Thread-23832] 
> exec.Task: Job Commit failed with exception 
> 'org.apache.hadoop.hive.ql.metadata.HiveException(java.io.EOFException)'
> org.apache.hadoop.hive.ql.metadata.HiveException: java.io.EOFException
>  at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.jobCloseOp(FileSinkOperator.java:1468)
>  at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:798)
>  at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:803)
>  at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:803)
>  at org.apache.hadoop.hive.ql.exec.tez.TezTask.close(TezTask.java:627)
>  at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:342)
>  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213)
>  at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
>  at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:357)
>  at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330)
>  at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246)
>  at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109)
>  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482)
>  at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:87)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:322)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:340)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.EOFException
>  at java.io.DataInputStream.readInt(DataInputStream.java:392)
>  at 
> org.apache.hadoop.hive.ql.exec.Utilities.handleDirectInsertTableFinalPath(Utilities.java:4587)
>  at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.jobCloseOp(FileSinkOperator.java:1462)
>  ... 29 more
> {noformat}
> The reason of the exception was that Hive was trying to read an empty 
> manifest file. Manifest files are used in case of direct insert to determine 
> which files needs to be kept and which one needs to be cleaned up. They are 
> created by the tasks and they use the task attempt Id as postfix. In this 
> particular test what happened is that one of the container ran out of memory 
> so Tez decided to kill it right after the manifest file got created but 
> before the paths got written into the manifest file. This was the manifest 
> file for the task attempt 0. Then Tez assigned a new container to the task, 
> so a new attempt was made with attemptId=1. This one was successful, and 
> wrote the manifest file correctly. But Hive didn't know about this, since 
> this out of memory issue got handled by Tez under the hood, so there was no 
> exception in Hive, therefore no clean-up in the manifest folder. And when 
> Hive is reading the manifest files, it just reads every file from the defined 
> folder, so it tried to read the manifest files for attempt 0 and 1 as well.
> If there are multiple manifest files with the same name but different 
> attemptId, Hive should 

[jira] [Commented] (HIVE-24322) In case of direct insert, the attempt ID has to be checked when reading the manifest files

2020-12-15 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249672#comment-17249672
 ] 

Marta Kuczora commented on HIVE-24322:
--

Pushed to master.
Thanks a lot [~szita] for the review!

> In case of direct insert, the attempt ID has to be checked when reading the 
> manifest files
> --
>
> Key: HIVE-24322
> URL: https://issues.apache.org/jira/browse/HIVE-24322
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In IMPALA-10247 there was an exception from Hive when tyring to load the data:
> {noformat}
> 2020-10-13T16:50:53,424 ERROR [HiveServer2-Background-Pool: Thread-23832] 
> exec.Task: Job Commit failed with exception 
> 'org.apache.hadoop.hive.ql.metadata.HiveException(java.io.EOFException)'
> org.apache.hadoop.hive.ql.metadata.HiveException: java.io.EOFException
>  at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.jobCloseOp(FileSinkOperator.java:1468)
>  at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:798)
>  at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:803)
>  at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:803)
>  at org.apache.hadoop.hive.ql.exec.tez.TezTask.close(TezTask.java:627)
>  at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:342)
>  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213)
>  at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
>  at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:357)
>  at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330)
>  at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246)
>  at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109)
>  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482)
>  at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:87)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:322)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:340)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.EOFException
>  at java.io.DataInputStream.readInt(DataInputStream.java:392)
>  at 
> org.apache.hadoop.hive.ql.exec.Utilities.handleDirectInsertTableFinalPath(Utilities.java:4587)
>  at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.jobCloseOp(FileSinkOperator.java:1462)
>  ... 29 more
> {noformat}
> The reason of the exception was that Hive was trying to read an empty 
> manifest file. Manifest files are used in case of direct insert to determine 
> which files needs to be kept and which one needs to be cleaned up. They are 
> created by the tasks and they use the task attempt Id as postfix. In this 
> particular test what happened is that one of the container ran out of memory 
> so Tez decided to kill it right after the manifest file got created but 
> before the paths got written into the manifest file. This was the manifest 
> file for the task attempt 0. Then Tez assigned a new container to the task, 
> so a new attempt was made with attemptId=1. This one was successful, and 
> wrote the manifest file correctly. But Hive didn't know about this, since 
> this out of memory issue got handled by Tez under the hood, so there was no 
> exception in Hive, therefore no clean-up in the manifest folder. And when 
> Hive is reading the manifest files, it just reads every file from the defined 
> folder, so it tried to read the manifest files for attempt 0 and 1 as well.
> If there are multiple 

[jira] [Updated] (HIVE-24530) Potential NPE in FileSinkOperator.closeRecordwriters method

2020-12-14 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-24530:
-
Description: 
During testing a NPE occurred in the FileSinkOperator.closeRecordwriters method.
After investigating, turned out there was an underlaying IOException during 
executing the FileSinkOperator.process method. It got caught by the following 
code part:
{noformat}
} catch (IOException e) {
  closeWriters(true);
  throw new HiveException(e);
} catch (SerDeException e) {
  closeWriters(true);
  throw new HiveException(e);
}
{noformat}
First the closeWriters method was called:
{noformat}
  private void closeWriters(boolean abort) throws HiveException {
fpaths.closeWriters(true);
closeRecordwriters(true);
  }

  private void closeRecordwriters(boolean abort) {
for (RecordWriter writer : rowOutWriters) {
  try {
LOG.info("Closing {} on exception", writer);
writer.close(abort);
  } catch (IOException e) {
LOG.error("Error closing rowOutWriter" + writer, e);
  }
}
{noformat}
If the writers had got closed successfully, a HiveException would have been 
thrown with the original IOException.
But when the IOException occurred the writers in the rowOutWriters were not yet 
initialised, so a NPE occurred. This was very misleading as the NPE was not the 
real issue, but the original IOException was hidden.


> Potential NPE in FileSinkOperator.closeRecordwriters method
> ---
>
> Key: HIVE-24530
> URL: https://issues.apache.org/jira/browse/HIVE-24530
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>
> During testing a NPE occurred in the FileSinkOperator.closeRecordwriters 
> method.
> After investigating, turned out there was an underlaying IOException during 
> executing the FileSinkOperator.process method. It got caught by the following 
> code part:
> {noformat}
> } catch (IOException e) {
>   closeWriters(true);
>   throw new HiveException(e);
> } catch (SerDeException e) {
>   closeWriters(true);
>   throw new HiveException(e);
> }
> {noformat}
> First the closeWriters method was called:
> {noformat}
>   private void closeWriters(boolean abort) throws HiveException {
> fpaths.closeWriters(true);
> closeRecordwriters(true);
>   }
>   private void closeRecordwriters(boolean abort) {
> for (RecordWriter writer : rowOutWriters) {
>   try {
> LOG.info("Closing {} on exception", writer);
> writer.close(abort);
>   } catch (IOException e) {
> LOG.error("Error closing rowOutWriter" + writer, e);
>   }
> }
> {noformat}
> If the writers had got closed successfully, a HiveException would have been 
> thrown with the original IOException.
> But when the IOException occurred the writers in the rowOutWriters were not 
> yet initialised, so a NPE occurred. This was very misleading as the NPE was 
> not the real issue, but the original IOException was hidden.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24530) Potential NPE in FileSinkOperator.closeRecordwriters method

2020-12-14 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-24530:



> Potential NPE in FileSinkOperator.closeRecordwriters method
> ---
>
> Key: HIVE-24530
> URL: https://issues.apache.org/jira/browse/HIVE-24530
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HIVE-23410) ACID: Improve the delete and update operations to avoid the move step

2020-12-08 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246181#comment-17246181
 ] 

Marta Kuczora edited comment on HIVE-23410 at 12/8/20, 11:09 PM:
-

Pushed to master.
Thanks a lot [~pvary] for the review!!


was (Author: kuczoram):
Pushed to master.
Thanks a lot @pvary for the review!!

> ACID: Improve the delete and update operations to avoid the move step
> -
>
> Key: HIVE-23410
> URL: https://issues.apache.org/jira/browse/HIVE-23410
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23410.1.patch
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> This is a follow-up task for 
> [HIVE-21164|https://issues.apache.org/jira/browse/HIVE-21164], where the 
> insert operation has been modified to write directly to the table locations 
> instead of the staging directory. The same improvement should be done for the 
> ACID update and delete operations as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23410) ACID: Improve the delete and update operations to avoid the move step

2020-12-08 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23410:
-
Fix Version/s: 4.0.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> ACID: Improve the delete and update operations to avoid the move step
> -
>
> Key: HIVE-23410
> URL: https://issues.apache.org/jira/browse/HIVE-23410
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23410.1.patch
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> This is a follow-up task for 
> [HIVE-21164|https://issues.apache.org/jira/browse/HIVE-21164], where the 
> insert operation has been modified to write directly to the table locations 
> instead of the staging directory. The same improvement should be done for the 
> ACID update and delete operations as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-23410) ACID: Improve the delete and update operations to avoid the move step

2020-12-08 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246181#comment-17246181
 ] 

Marta Kuczora commented on HIVE-23410:
--

Pushed to master.
Thanks a lot @pvary for the review!!

> ACID: Improve the delete and update operations to avoid the move step
> -
>
> Key: HIVE-23410
> URL: https://issues.apache.org/jira/browse/HIVE-23410
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-23410.1.patch
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> This is a follow-up task for 
> [HIVE-21164|https://issues.apache.org/jira/browse/HIVE-21164], where the 
> insert operation has been modified to write directly to the table locations 
> instead of the staging directory. The same improvement should be done for the 
> ACID update and delete operations as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24506) Investigate the materialized_view_create_rewrite_4.q test with direct insert on

2020-12-08 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-24506:
-
Description: In the materialized_view_create_rewrite_4.q the direct insert 
got turned off, because if it was on, the totalSize of the table alternated 
between two values from run to run. In other test cases this issue was due to 
the order in which the FSOs got the statementIds. Since the direct insert is 
not necessary for materialized views, I turned it off for this test in 
HIVE-23410 and will investigate under this Jira.

> Investigate the materialized_view_create_rewrite_4.q test with direct insert 
> on
> ---
>
> Key: HIVE-24506
> URL: https://issues.apache.org/jira/browse/HIVE-24506
> Project: Hive
>  Issue Type: Task
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>
> In the materialized_view_create_rewrite_4.q the direct insert got turned off, 
> because if it was on, the totalSize of the table alternated between two 
> values from run to run. In other test cases this issue was due to the order 
> in which the FSOs got the statementIds. Since the direct insert is not 
> necessary for materialized views, I turned it off for this test in HIVE-23410 
> and will investigate under this Jira.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24506) Investigate the materialized_view_create_rewrite_4.q test with direct insert on

2020-12-08 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-24506:


Assignee: Marta Kuczora

> Investigate the materialized_view_create_rewrite_4.q test with direct insert 
> on
> ---
>
> Key: HIVE-24506
> URL: https://issues.apache.org/jira/browse/HIVE-24506
> Project: Hive
>  Issue Type: Task
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24506) Investigate the materialized_view_create_rewrite_4.q test with direct insert on

2020-12-08 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-24506:
-
Affects Version/s: 4.0.0

> Investigate the materialized_view_create_rewrite_4.q test with direct insert 
> on
> ---
>
> Key: HIVE-24506
> URL: https://issues.apache.org/jira/browse/HIVE-24506
> Project: Hive
>  Issue Type: Task
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24505) Investigate if the arrays in the FileSinkOperator could be replaced by Lists

2020-12-08 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-24505:



> Investigate if the arrays in the FileSinkOperator could be replaced by Lists
> 
>
> Key: HIVE-24505
> URL: https://issues.apache.org/jira/browse/HIVE-24505
> Project: Hive
>  Issue Type: Task
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>
> The FileSinkOperator uses some array variables, like
> Path[] outPaths;
> Path[] outPathsCommitted;
> Path[] finalPaths;
> RecordWriter[] outWriters;
> RecordUpdater[] updaters;
> Working with these is not always convenient, like when in the 
> createDynamicBucket method, they are extended with elements. Or in case of an 
> UPDATE operation with direct insert on. Then the delete deltas have to be 
> collected separately, because the outPaths array will contain only the 
> inserted deltas. These operations would be much easier with lists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (HIVE-24336) Turn off the direct insert for EXPLAIN ANALYZE queries

2020-11-10 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-24336 started by Marta Kuczora.

> Turn off the direct insert for EXPLAIN ANALYZE queries
> --
>
> Key: HIVE-24336
> URL: https://issues.apache.org/jira/browse/HIVE-24336
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If we do an EXPLAIN ANALYZE for an INSERT query with direct insert on, the 
> new files will be created in the table directory, and they won't be 
> cleaned-up when the EXPLAIN query is finished.
> Example: 
> {noformat}
> create table analyze_table (id int) stored as orc 
> tblproperties('transactional'='true');
> explain analyze insert into analyze_table values (1),(2),(3),(4);
> select * from analyze_table;
> 1
> 2
> 3
> 4
> Time taken: 0.1 seconds, Fetched: 4 row(s)
> The result should be empty after the explain command.
> {noformat}
> An EXPLAIN ANALYZE query will execute the actual query and the files will be 
> created within the staging directory, but the MoveTask won't move them to the 
> final location. So when the EXPLAIN ANALYZE query is finished, the staging 
> directory will be deleted, and the table directory will be the same as before 
> the EXPLAIN query. But with direct insert on the files will be written into 
> the table directory, so an additional cleanup would be necessary in order to 
> restore the files within the table directory to the state before the EXPLAIN 
> ANALYZE query. This could be avoided by turning off the direct insert for an 
> EXPLAIN ANALYZE query. Since the direct insert improves the performance by 
> eliminating the file movements within the MoveTask, but it has no affect on 
> the query execution plan it can be safely turned off for explain queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24336) Turn off the direct insert for EXPLAIN ANALYZE queries

2020-11-10 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora resolved HIVE-24336.
--
Resolution: Fixed

> Turn off the direct insert for EXPLAIN ANALYZE queries
> --
>
> Key: HIVE-24336
> URL: https://issues.apache.org/jira/browse/HIVE-24336
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If we do an EXPLAIN ANALYZE for an INSERT query with direct insert on, the 
> new files will be created in the table directory, and they won't be 
> cleaned-up when the EXPLAIN query is finished.
> Example: 
> {noformat}
> create table analyze_table (id int) stored as orc 
> tblproperties('transactional'='true');
> explain analyze insert into analyze_table values (1),(2),(3),(4);
> select * from analyze_table;
> 1
> 2
> 3
> 4
> Time taken: 0.1 seconds, Fetched: 4 row(s)
> The result should be empty after the explain command.
> {noformat}
> An EXPLAIN ANALYZE query will execute the actual query and the files will be 
> created within the staging directory, but the MoveTask won't move them to the 
> final location. So when the EXPLAIN ANALYZE query is finished, the staging 
> directory will be deleted, and the table directory will be the same as before 
> the EXPLAIN query. But with direct insert on the files will be written into 
> the table directory, so an additional cleanup would be necessary in order to 
> restore the files within the table directory to the state before the EXPLAIN 
> ANALYZE query. This could be avoided by turning off the direct insert for an 
> EXPLAIN ANALYZE query. Since the direct insert improves the performance by 
> eliminating the file movements within the MoveTask, but it has no affect on 
> the query execution plan it can be safely turned off for explain queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24336) Turn off the direct insert for EXPLAIN ANALYZE queries

2020-11-10 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229456#comment-17229456
 ] 

Marta Kuczora commented on HIVE-24336:
--

Pushed to master. Thanks a lot [~szita] for the review!

> Turn off the direct insert for EXPLAIN ANALYZE queries
> --
>
> Key: HIVE-24336
> URL: https://issues.apache.org/jira/browse/HIVE-24336
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If we do an EXPLAIN ANALYZE for an INSERT query with direct insert on, the 
> new files will be created in the table directory, and they won't be 
> cleaned-up when the EXPLAIN query is finished.
> Example: 
> {noformat}
> create table analyze_table (id int) stored as orc 
> tblproperties('transactional'='true');
> explain analyze insert into analyze_table values (1),(2),(3),(4);
> select * from analyze_table;
> 1
> 2
> 3
> 4
> Time taken: 0.1 seconds, Fetched: 4 row(s)
> The result should be empty after the explain command.
> {noformat}
> An EXPLAIN ANALYZE query will execute the actual query and the files will be 
> created within the staging directory, but the MoveTask won't move them to the 
> final location. So when the EXPLAIN ANALYZE query is finished, the staging 
> directory will be deleted, and the table directory will be the same as before 
> the EXPLAIN query. But with direct insert on the files will be written into 
> the table directory, so an additional cleanup would be necessary in order to 
> restore the files within the table directory to the state before the EXPLAIN 
> ANALYZE query. This could be avoided by turning off the direct insert for an 
> EXPLAIN ANALYZE query. Since the direct insert improves the performance by 
> eliminating the file movements within the MoveTask, but it has no affect on 
> the query execution plan it can be safely turned off for explain queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24336) Turn off the direct insert for EXPLAIN ANALYZE queries

2020-10-30 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-24336:
-
Description: 
If we do an EXPLAIN ANALYZE for an INSERT query with direct insert on, the new 
files will be created in the table directory, and they won't be cleaned-up when 
the EXPLAIN query is finished.

Example: 
{noformat}
create table analyze_table (id int) stored as orc 
tblproperties('transactional'='true');
explain analyze insert into analyze_table values (1),(2),(3),(4);

select * from analyze_table;
1
2
3
4
Time taken: 0.1 seconds, Fetched: 4 row(s)

The result should be empty after the explain command.
{noformat}

An EXPLAIN ANALYZE query will execute the actual query and the files will be 
created within the staging directory, but the MoveTask won't move them to the 
final location. So when the EXPLAIN ANALYZE query is finished, the staging 
directory will be deleted, and the table directory will be the same as before 
the EXPLAIN query. But with direct insert on the files will be written into the 
table directory, so an additional cleanup would be necessary in order to 
restore the files within the table directory to the state before the EXPLAIN 
ANALYZE query. This could be avoided by turning off the direct insert for an 
EXPLAIN ANALYZE query. Since the direct insert improves the performance by 
eliminating the file movements within the MoveTask, but it has no affect on the 
query execution plan it can be safely turned off for explain queries.

  was:
If we do an EXPLAIN ANALYZE for an INSERT query with direct insert on, the new 
files will be created in the table directory, and they won't be cleaned-up when 
the EXPLAIN query is finished.

Example:

 

An EXPLAIN ANALYZE query will execute the actual query and the files will be 
created within the staging directory, but the MoveTask won't move them to the 
final location. So when the EXPLAIN ANALYZE query is finished, the staging 
directory will be deleted, and the table directory will be the same as before 
the EXPLAIN query. But with direct insert on the files will be written into the 
table directory, so an additional cleanup would be necessary in order to 
restore the files within the table directory to the state before the EXPLAIN 
ANALYZE query. This could be avoided by turning off the direct insert for an 
EXPLAIN ANALYZE query. Since the direct insert improves the performance by 
eliminating the file movements within the MoveTask, but it has no affect on the 
query execution plan it can be safely turned off for explain queries.


> Turn off the direct insert for EXPLAIN ANALYZE queries
> --
>
> Key: HIVE-24336
> URL: https://issues.apache.org/jira/browse/HIVE-24336
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> If we do an EXPLAIN ANALYZE for an INSERT query with direct insert on, the 
> new files will be created in the table directory, and they won't be 
> cleaned-up when the EXPLAIN query is finished.
> Example: 
> {noformat}
> create table analyze_table (id int) stored as orc 
> tblproperties('transactional'='true');
> explain analyze insert into analyze_table values (1),(2),(3),(4);
> select * from analyze_table;
> 1
> 2
> 3
> 4
> Time taken: 0.1 seconds, Fetched: 4 row(s)
> The result should be empty after the explain command.
> {noformat}
> An EXPLAIN ANALYZE query will execute the actual query and the files will be 
> created within the staging directory, but the MoveTask won't move them to the 
> final location. So when the EXPLAIN ANALYZE query is finished, the staging 
> directory will be deleted, and the table directory will be the same as before 
> the EXPLAIN query. But with direct insert on the files will be written into 
> the table directory, so an additional cleanup would be necessary in order to 
> restore the files within the table directory to the state before the EXPLAIN 
> ANALYZE query. This could be avoided by turning off the direct insert for an 
> EXPLAIN ANALYZE query. Since the direct insert improves the performance by 
> eliminating the file movements within the MoveTask, but it has no affect on 
> the query execution plan it can be safely turned off for explain queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24336) Turn off the direct insert for EXPLAIN ANALYZE queries

2020-10-30 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-24336:
-
Description: 
If we do an EXPLAIN ANALYZE for an INSERT query with direct insert on, the new 
files will be created in the table directory, and they won't be cleaned-up when 
the EXPLAIN query is finished.

Example:

 

An EXPLAIN ANALYZE query will execute the actual query and the files will be 
created within the staging directory, but the MoveTask won't move them to the 
final location. So when the EXPLAIN ANALYZE query is finished, the staging 
directory will be deleted, and the table directory will be the same as before 
the EXPLAIN query. But with direct insert on the files will be written into the 
table directory, so an additional cleanup would be necessary in order to 
restore the files within the table directory to the state before the EXPLAIN 
ANALYZE query. This could be avoided by turning off the direct insert for an 
EXPLAIN ANALYZE query. Since the direct insert improves the performance by 
eliminating the file movements within the MoveTask, but it has no affect on the 
query execution plan it can be safely turned off for explain queries.

> Turn off the direct insert for EXPLAIN ANALYZE queries
> --
>
> Key: HIVE-24336
> URL: https://issues.apache.org/jira/browse/HIVE-24336
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> If we do an EXPLAIN ANALYZE for an INSERT query with direct insert on, the 
> new files will be created in the table directory, and they won't be 
> cleaned-up when the EXPLAIN query is finished.
> Example:
>  
> An EXPLAIN ANALYZE query will execute the actual query and the files will be 
> created within the staging directory, but the MoveTask won't move them to the 
> final location. So when the EXPLAIN ANALYZE query is finished, the staging 
> directory will be deleted, and the table directory will be the same as before 
> the EXPLAIN query. But with direct insert on the files will be written into 
> the table directory, so an additional cleanup would be necessary in order to 
> restore the files within the table directory to the state before the EXPLAIN 
> ANALYZE query. This could be avoided by turning off the direct insert for an 
> EXPLAIN ANALYZE query. Since the direct insert improves the performance by 
> eliminating the file movements within the MoveTask, but it has no affect on 
> the query execution plan it can be safely turned off for explain queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24336) Turn off the direct insert for EXPLAIN ANALYZE queries

2020-10-30 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-24336:



> Turn off the direct insert for EXPLAIN ANALYZE queries
> --
>
> Key: HIVE-24336
> URL: https://issues.apache.org/jira/browse/HIVE-24336
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24322) In case of direct insert, the attempt ID has to be checked when reading the manifest files

2020-10-28 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-24322:
-
Description: 
In IMPALA-10247 there was an exception from Hive when tyring to load the data:
{noformat}
2020-10-13T16:50:53,424 ERROR [HiveServer2-Background-Pool: Thread-23832] 
exec.Task: Job Commit failed with exception 
'org.apache.hadoop.hive.ql.metadata.HiveException(java.io.EOFException)'
org.apache.hadoop.hive.ql.metadata.HiveException: java.io.EOFException
 at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.jobCloseOp(FileSinkOperator.java:1468)
 at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:798)
 at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:803)
 at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:803)
 at org.apache.hadoop.hive.ql.exec.tez.TezTask.close(TezTask.java:627)
 at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:342)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213)
 at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
 at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:357)
 at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330)
 at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246)
 at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482)
 at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166)
 at 
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
 at 
org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:87)
 at 
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:322)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
 at 
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:340)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:392)
 at 
org.apache.hadoop.hive.ql.exec.Utilities.handleDirectInsertTableFinalPath(Utilities.java:4587)
 at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.jobCloseOp(FileSinkOperator.java:1462)
 ... 29 more
{noformat}
The reason of the exception was that Hive was trying to read an empty manifest 
file. Manifest files are used in case of direct insert to determine which files 
needs to be kept and which one needs to be cleaned up. They are created by the 
tasks and they use the task attempt Id as postfix. In this particular test what 
happened is that one of the container ran out of memory so Tez decided to kill 
it right after the manifest file got created but before the paths got written 
into the manifest file. This was the manifest file for the task attempt 0. Then 
Tez assigned a new container to the task, so a new attempt was made with 
attemptId=1. This one was successful, and wrote the manifest file correctly. 
But Hive didn't know about this, since this out of memory issue got handled by 
Tez under the hood, so there was no exception in Hive, therefore no clean-up in 
the manifest folder. And when Hive is reading the manifest files, it just reads 
every file from the defined folder, so it tried to read the manifest files for 
attempt 0 and 1 as well.
If there are multiple manifest files with the same name but different 
attemptId, Hive should only read the one with the biggest attempt Id.

  was:
In [IMPALA-10247|https://issues.apache.org/jira/browse/IMPALA-10247] there was 
an exception from Hive when tyring to load the data:
{noformat}
2020-10-13T16:50:53,424 ERROR [HiveServer2-Background-Pool: Thread-23832] 
exec.Task: Job Commit failed with exception 
'org.apache.hadoop.hive.ql.metadata.HiveException(java.io.EOFException)'
org.apache.hadoop.hive.ql.metadata.HiveException: java.io.EOFException
 at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.jobCloseOp(FileSinkOperator.java:1468)
 at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:798)
 at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:803)
 at 

[jira] [Assigned] (HIVE-24322) In case of direct insert, the attempt ID has to be checked when reading the manifest files

2020-10-28 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-24322:



> In case of direct insert, the attempt ID has to be checked when reading the 
> manifest files
> --
>
> Key: HIVE-24322
> URL: https://issues.apache.org/jira/browse/HIVE-24322
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> In [IMPALA-10247|https://issues.apache.org/jira/browse/IMPALA-10247] there 
> was an exception from Hive when tyring to load the data:
> {noformat}
> 2020-10-13T16:50:53,424 ERROR [HiveServer2-Background-Pool: Thread-23832] 
> exec.Task: Job Commit failed with exception 
> 'org.apache.hadoop.hive.ql.metadata.HiveException(java.io.EOFException)'
> org.apache.hadoop.hive.ql.metadata.HiveException: java.io.EOFException
>  at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.jobCloseOp(FileSinkOperator.java:1468)
>  at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:798)
>  at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:803)
>  at org.apache.hadoop.hive.ql.exec.Operator.jobClose(Operator.java:803)
>  at org.apache.hadoop.hive.ql.exec.tez.TezTask.close(TezTask.java:627)
>  at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:342)
>  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213)
>  at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
>  at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:357)
>  at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330)
>  at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246)
>  at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109)
>  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488)
>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482)
>  at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:87)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:322)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>  at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:340)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.EOFException
>  at java.io.DataInputStream.readInt(DataInputStream.java:392)
>  at 
> org.apache.hadoop.hive.ql.exec.Utilities.handleDirectInsertTableFinalPath(Utilities.java:4587)
>  at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.jobCloseOp(FileSinkOperator.java:1462)
>  ... 29 more
> {noformat}
> The reason of the exception was that Hive was trying to read an empty 
> manifest file. Manifest files are used in case of direct insert to determine 
> which files needs to be kept and which one needs to be cleaned up. They are 
> created by the tasks and they use the tast attempt Id as postfix. In this 
> particular test what happened is that one of the container ran out of memory 
> so Tez decided to kill it right after the manifest file got created but 
> before the pathes got written into the manifest file. This was the manifest 
> file for the task attempt 0. Then Tez assigned a new container to the task, 
> so a new attemp was made with attemptId=1. This one was successful, and wrote 
> the manifest file correctly. But Hive didn't know about this, since this out 
> of memory issue got handled by Tez under the hood, so there was no exception 
> in Hive, therefore no clean-up in the manifest folder. And when Hive is 
> reading the manifest files, it just reads every file from the defined folder, 
> so it tried to read the manifest files for attemp 0 and 1 as well.
> If there are multiple manifest files with the same name but different 
> attemptId, Hive should only read the one with the biggest attempt Id.



--
This 

[jira] [Updated] (HIVE-24213) Incorrect exception in the Merge MapJoinTask into its child MapRedTask optimizer

2020-10-02 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-24213:
-
Fix Version/s: 4.0.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Pushed to master.

Thanks a lot [~zmatyus] for the patch and [~kgyrtkirk] for the review!

> Incorrect exception in the Merge MapJoinTask into its child MapRedTask 
> optimizer
> 
>
> Key: HIVE-24213
> URL: https://issues.apache.org/jira/browse/HIVE-24213
> Project: Hive
>  Issue Type: Bug
>  Components: Physical Optimizer
>Affects Versions: 4.0.0
>Reporter: Zoltan Matyus
>Assignee: Zoltan Matyus
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The {{CommonJoinTaskDispatcher#mergeMapJoinTaskIntoItsChildMapRedTask}} 
> method throws a {{SemanticException}} if the number of {{FileSinkOperator}}s 
> it finds is not exactly 1. The exception is valid if zero operators are 
> found, but there can be valid use cases where multiple FileSinkOperators 
> exist.
> Example: the MapJoin and it child are used in a common table expression, 
> which is used for multiple inserts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-24163) Dynamic Partitioning Insert for MM table fails during MoveTask

2020-09-22 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora resolved HIVE-24163.
--
Fix Version/s: (was: 3.1.2)
   4.0.0
   Resolution: Fixed

> Dynamic Partitioning Insert for MM table fails during MoveTask
> --
>
> Key: HIVE-24163
> URL: https://issues.apache.org/jira/browse/HIVE-24163
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Rajkumar Singh
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> -- DDLs and Query
> {code:java}
> create table `class` (name varchar(8), sex varchar(1), age double precision, 
> height double precision, weight double precision);
> insert into table class values ('RAJ','MALE',28,12,12);
> CREATE TABLE `PART1` (`id` DOUBLE,`N` DOUBLE,`Name` VARCHAR(8),`Sex` 
> VARCHAR(1)) PARTITIONED BY(Weight string, Age
> string, Height string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' 
> LINES TERMINATED BY '\012' STORED AS TEXTFILE;
> INSERT INTO TABLE `part1` PARTITION (`Weight`,`Age`,`Height`)  SELECT 0, 0, 
> `Name`,`Sex`,`Weight`,`Age`,`Height` FROM `class`;
> {code}
> it fail during the MoveTask execution:
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: partition 
> hdfs://hostname:8020/warehouse/tablespace/managed/hive/part1/.hive-staging_hive_2020-09-02_13-29-58_765_4475282758764123921-1/-ext-1/tmpstats-0_FS_3
>  is not a directory!
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getValidPartitionsInPath(Hive.java:2769)
>  ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:2837) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleDynParts(MoveTask.java:562) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:440) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  ~[hive-service-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> {code}
> The reason is Task write the fsstat during the FileSinkOperator closing, HS2 
> ran the MoveTask to move data into the destination partition directory, while 
> getting the partition location hive check whether destination is directory or 
> not and failing.
> -- hive set the stat location during 
> https://github.com/apache/hive/blob/d700ea54ec5da5364d92a9faaa58f89ea03181e0/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L8135
> which is relative to the  hive-staging directory:
> https://github.com/apache/hive/blob/fecad5b0f72c535ed1c53f2cc62b0d6649b651ae/ql/src/java/org/apache/hadoop/hive/ql/Context.java#L617



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24163) Dynamic Partitioning Insert for MM table fails during MoveTask

2020-09-22 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200018#comment-17200018
 ] 

Marta Kuczora commented on HIVE-24163:
--

Pushed to master.

Thanks a lot for the review [~pvary]

> Dynamic Partitioning Insert for MM table fails during MoveTask
> --
>
> Key: HIVE-24163
> URL: https://issues.apache.org/jira/browse/HIVE-24163
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Rajkumar Singh
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.1.2
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> -- DDLs and Query
> {code:java}
> create table `class` (name varchar(8), sex varchar(1), age double precision, 
> height double precision, weight double precision);
> insert into table class values ('RAJ','MALE',28,12,12);
> CREATE TABLE `PART1` (`id` DOUBLE,`N` DOUBLE,`Name` VARCHAR(8),`Sex` 
> VARCHAR(1)) PARTITIONED BY(Weight string, Age
> string, Height string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' 
> LINES TERMINATED BY '\012' STORED AS TEXTFILE;
> INSERT INTO TABLE `part1` PARTITION (`Weight`,`Age`,`Height`)  SELECT 0, 0, 
> `Name`,`Sex`,`Weight`,`Age`,`Height` FROM `class`;
> {code}
> it fail during the MoveTask execution:
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: partition 
> hdfs://hostname:8020/warehouse/tablespace/managed/hive/part1/.hive-staging_hive_2020-09-02_13-29-58_765_4475282758764123921-1/-ext-1/tmpstats-0_FS_3
>  is not a directory!
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getValidPartitionsInPath(Hive.java:2769)
>  ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:2837) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleDynParts(MoveTask.java:562) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:440) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  ~[hive-service-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> {code}
> The reason is Task write the fsstat during the FileSinkOperator closing, HS2 
> ran the MoveTask to move data into the destination partition directory, while 
> getting the partition location hive check whether destination is directory or 
> not and failing.
> -- hive set the stat location during 
> https://github.com/apache/hive/blob/d700ea54ec5da5364d92a9faaa58f89ea03181e0/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L8135
> which is relative to the  hive-staging directory:
> https://github.com/apache/hive/blob/fecad5b0f72c535ed1c53f2cc62b0d6649b651ae/ql/src/java/org/apache/hadoop/hive/ql/Context.java#L617



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-24163) Dynamic Partitioning Insert for MM table fails during MoveTask

2020-09-22 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-24163:
-
Summary: Dynamic Partitioning Insert for MM table fails during MoveTask  
(was: Dynamic Partitioning Insert fail for MM table fail during MoveTask)

> Dynamic Partitioning Insert for MM table fails during MoveTask
> --
>
> Key: HIVE-24163
> URL: https://issues.apache.org/jira/browse/HIVE-24163
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Rajkumar Singh
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.1.2
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> -- DDLs and Query
> {code:java}
> create table `class` (name varchar(8), sex varchar(1), age double precision, 
> height double precision, weight double precision);
> insert into table class values ('RAJ','MALE',28,12,12);
> CREATE TABLE `PART1` (`id` DOUBLE,`N` DOUBLE,`Name` VARCHAR(8),`Sex` 
> VARCHAR(1)) PARTITIONED BY(Weight string, Age
> string, Height string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' 
> LINES TERMINATED BY '\012' STORED AS TEXTFILE;
> INSERT INTO TABLE `part1` PARTITION (`Weight`,`Age`,`Height`)  SELECT 0, 0, 
> `Name`,`Sex`,`Weight`,`Age`,`Height` FROM `class`;
> {code}
> it fail during the MoveTask execution:
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: partition 
> hdfs://hostname:8020/warehouse/tablespace/managed/hive/part1/.hive-staging_hive_2020-09-02_13-29-58_765_4475282758764123921-1/-ext-1/tmpstats-0_FS_3
>  is not a directory!
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getValidPartitionsInPath(Hive.java:2769)
>  ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:2837) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleDynParts(MoveTask.java:562) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:440) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  ~[hive-service-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> {code}
> The reason is Task write the fsstat during the FileSinkOperator closing, HS2 
> ran the MoveTask to move data into the destination partition directory, while 
> getting the partition location hive check whether destination is directory or 
> not and failing.
> -- hive set the stat location during 
> https://github.com/apache/hive/blob/d700ea54ec5da5364d92a9faaa58f89ea03181e0/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L8135
> which is relative to the  hive-staging directory:
> https://github.com/apache/hive/blob/fecad5b0f72c535ed1c53f2cc62b0d6649b651ae/ql/src/java/org/apache/hadoop/hive/ql/Context.java#L617



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24163) Dynamic Partitioning Insert fail for MM table fail during MoveTask

2020-09-21 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199477#comment-17199477
 ] 

Marta Kuczora commented on HIVE-24163:
--

There was a typo in the direct insert path. But turned out that there is more 
issue. 
The file listing in the Utilities.getFullDPSpecs method was not correct for MM 
tables and for ACID tables when direct insert was on. This method returned all 
partitions from these tables, not just the ones affected by the current query. 
Because of this, the lineage information for inserting with dynamic 
partitioning into tables like these was not correct. Compared the lineage 
information with when inserting into external tables and for external tables 
only the partitions are present which are affected by the query. This is 
because for external tables when inserting into the table, the data first get 
written into the staging dir and when listing the partitions, this directory is 
checked and it contains only the newly inserted data. But for MM tables and 
ACID direct insert, the staging dir is missing, so it will check the table 
directory and lists everything from it.

> Dynamic Partitioning Insert fail for MM table fail during MoveTask
> --
>
> Key: HIVE-24163
> URL: https://issues.apache.org/jira/browse/HIVE-24163
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Rajkumar Singh
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.1.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> -- DDLs and Query
> {code:java}
> create table `class` (name varchar(8), sex varchar(1), age double precision, 
> height double precision, weight double precision);
> insert into table class values ('RAJ','MALE',28,12,12);
> CREATE TABLE `PART1` (`id` DOUBLE,`N` DOUBLE,`Name` VARCHAR(8),`Sex` 
> VARCHAR(1)) PARTITIONED BY(Weight string, Age
> string, Height string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' 
> LINES TERMINATED BY '\012' STORED AS TEXTFILE;
> INSERT INTO TABLE `part1` PARTITION (`Weight`,`Age`,`Height`)  SELECT 0, 0, 
> `Name`,`Sex`,`Weight`,`Age`,`Height` FROM `class`;
> {code}
> it fail during the MoveTask execution:
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: partition 
> hdfs://hostname:8020/warehouse/tablespace/managed/hive/part1/.hive-staging_hive_2020-09-02_13-29-58_765_4475282758764123921-1/-ext-1/tmpstats-0_FS_3
>  is not a directory!
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getValidPartitionsInPath(Hive.java:2769)
>  ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:2837) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleDynParts(MoveTask.java:562) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:440) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  ~[hive-service-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> {code}
> The reason is Task write the fsstat during the FileSinkOperator closing, HS2 
> ran the MoveTask to move data into the 

[jira] [Work started] (HIVE-24163) Dynamic Partitioning Insert fail for MM table fail during MoveTask

2020-09-18 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-24163 started by Marta Kuczora.

> Dynamic Partitioning Insert fail for MM table fail during MoveTask
> --
>
> Key: HIVE-24163
> URL: https://issues.apache.org/jira/browse/HIVE-24163
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Rajkumar Singh
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 3.1.2
>
>
> -- DDLs and Query
> {code:java}
> create table `class` (name varchar(8), sex varchar(1), age double precision, 
> height double precision, weight double precision);
> insert into table class values ('RAJ','MALE',28,12,12);
> CREATE TABLE `PART1` (`id` DOUBLE,`N` DOUBLE,`Name` VARCHAR(8),`Sex` 
> VARCHAR(1)) PARTITIONED BY(Weight string, Age
> string, Height string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' 
> LINES TERMINATED BY '\012' STORED AS TEXTFILE;
> INSERT INTO TABLE `part1` PARTITION (`Weight`,`Age`,`Height`)  SELECT 0, 0, 
> `Name`,`Sex`,`Weight`,`Age`,`Height` FROM `class`;
> {code}
> it fail during the MoveTask execution:
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: partition 
> hdfs://hostname:8020/warehouse/tablespace/managed/hive/part1/.hive-staging_hive_2020-09-02_13-29-58_765_4475282758764123921-1/-ext-1/tmpstats-0_FS_3
>  is not a directory!
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getValidPartitionsInPath(Hive.java:2769)
>  ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:2837) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleDynParts(MoveTask.java:562) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:440) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  ~[hive-service-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> {code}
> The reason is Task write the fsstat during the FileSinkOperator closing, HS2 
> ran the MoveTask to move data into the destination partition directory, while 
> getting the partition location hive check whether destination is directory or 
> not and failing.
> -- hive set the stat location during 
> https://github.com/apache/hive/blob/d700ea54ec5da5364d92a9faaa58f89ea03181e0/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L8135
> which is relative to the  hive-staging directory:
> https://github.com/apache/hive/blob/fecad5b0f72c535ed1c53f2cc62b0d6649b651ae/ql/src/java/org/apache/hadoop/hive/ql/Context.java#L617



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-24163) Dynamic Partitioning Insert fail for MM table fail during MoveTask

2020-09-18 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-24163:


Assignee: Marta Kuczora

> Dynamic Partitioning Insert fail for MM table fail during MoveTask
> --
>
> Key: HIVE-24163
> URL: https://issues.apache.org/jira/browse/HIVE-24163
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: Rajkumar Singh
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 3.1.2
>
>
> -- DDLs and Query
> {code:java}
> create table `class` (name varchar(8), sex varchar(1), age double precision, 
> height double precision, weight double precision);
> insert into table class values ('RAJ','MALE',28,12,12);
> CREATE TABLE `PART1` (`id` DOUBLE,`N` DOUBLE,`Name` VARCHAR(8),`Sex` 
> VARCHAR(1)) PARTITIONED BY(Weight string, Age
> string, Height string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' 
> LINES TERMINATED BY '\012' STORED AS TEXTFILE;
> INSERT INTO TABLE `part1` PARTITION (`Weight`,`Age`,`Height`)  SELECT 0, 0, 
> `Name`,`Sex`,`Weight`,`Age`,`Height` FROM `class`;
> {code}
> it fail during the MoveTask execution:
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: partition 
> hdfs://hostname:8020/warehouse/tablespace/managed/hive/part1/.hive-staging_hive_2020-09-02_13-29-58_765_4475282758764123921-1/-ext-1/tmpstats-0_FS_3
>  is not a directory!
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.getValidPartitionsInPath(Hive.java:2769)
>  ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions(Hive.java:2837) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleDynParts(MoveTask.java:562) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:440) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) 
> ~[hive-exec-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  ~[hive-service-3.1.3000.7.2.0.0-237.jar:3.1.3000.7.2.0.0-237]
> {code}
> The reason is Task write the fsstat during the FileSinkOperator closing, HS2 
> ran the MoveTask to move data into the destination partition directory, while 
> getting the partition location hive check whether destination is directory or 
> not and failing.
> -- hive set the stat location during 
> https://github.com/apache/hive/blob/d700ea54ec5da5364d92a9faaa58f89ea03181e0/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L8135
> which is relative to the  hive-staging directory:
> https://github.com/apache/hive/blob/fecad5b0f72c535ed1c53f2cc62b0d6649b651ae/ql/src/java/org/apache/hadoop/hive/ql/Context.java#L617



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24023) Hive parquet reader can't read files with length=0

2020-08-25 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183843#comment-17183843
 ] 

Marta Kuczora commented on HIVE-24023:
--

+1

Thanks a lot [~klcopp] for the patch.

> Hive parquet reader can't read files with length=0
> --
>
> Key: HIVE-24023
> URL: https://issues.apache.org/jira/browse/HIVE-24023
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Impala truncates insert-only parquet tables by creating a base directory 
> containing a completely empty file.
> Hive throws an exception upon reading when it looks for metadata:
> {code:java}
> Error: java.io.IOException: java.lang.RuntimeException:  is not a 
> Parquet file (too small length: 0) (state=,code=0){code}
> We can introduce a check for an empty file before Hive tries to read the 
> metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HIVE-23763) Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator

2020-08-04 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora resolved HIVE-23763.
--
Resolution: Fixed

> Query based minor compaction produces wrong files when rows with different 
> buckets Ids are processed by the same FileSinkOperator
> -
>
> Key: HIVE-23763
> URL: https://issues.apache.org/jira/browse/HIVE-23763
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> How to reproduce:
> - Create an unbucketed ACID table
> - Insert a bigger amount of data into this table so there would be multiple 
> bucket files in the table
> The files in the table should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_0_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_1_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_2_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_3_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_4_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_5_0
> - Do some delete on rows with different bucket Ids
> The files in a delete delta should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_002_002_/bucket_0
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_3
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_1
> - Run the query-based minor compaction
> - After the compaction the newly created delete delta containes only 1 bucket 
> file. This file contains rows from all buckets and the table becomes unusable
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_001_007_v066/bucket_0
> The issue happens only if rows with different bucket Ids are processed by the 
> same FileSinkOperator. 
> In the FileSinkOperator.process method, the files for the compaction table 
> are created like this:
> {noformat}
> if (!bDynParts && !filesCreated) {
>   if (lbDirName != null) {
> if (valToPaths.get(lbDirName) == null) {
>   createNewPaths(null, lbDirName);
> }
>   } else {
> if (conf.isCompactionTable()) {
>   int bucketProperty = getBucketProperty(row);
>   bucketId = 
> BucketCodec.determineVersion(bucketProperty).decodeWriterId(bucketProperty);
> }
> createBucketFiles(fsp);
>   }
> }
> {noformat}
> When the first row is processed, the file is created and then the 
> filesCreated variable is set to true. Then when the other rows are processed, 
> the first if statement will be false, so no new file gets created, but the 
> row will be written into the file created for the first row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-23763) Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator

2020-08-04 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170715#comment-17170715
 ] 

Marta Kuczora commented on HIVE-23763:
--

Pushed to master. Thanks a lot [~pvary] for the review!

> Query based minor compaction produces wrong files when rows with different 
> buckets Ids are processed by the same FileSinkOperator
> -
>
> Key: HIVE-23763
> URL: https://issues.apache.org/jira/browse/HIVE-23763
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> How to reproduce:
> - Create an unbucketed ACID table
> - Insert a bigger amount of data into this table so there would be multiple 
> bucket files in the table
> The files in the table should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_0_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_1_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_2_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_3_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_4_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_5_0
> - Do some delete on rows with different bucket Ids
> The files in a delete delta should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_002_002_/bucket_0
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_3
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_1
> - Run the query-based minor compaction
> - After the compaction the newly created delete delta containes only 1 bucket 
> file. This file contains rows from all buckets and the table becomes unusable
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_001_007_v066/bucket_0
> The issue happens only if rows with different bucket Ids are processed by the 
> same FileSinkOperator. 
> In the FileSinkOperator.process method, the files for the compaction table 
> are created like this:
> {noformat}
> if (!bDynParts && !filesCreated) {
>   if (lbDirName != null) {
> if (valToPaths.get(lbDirName) == null) {
>   createNewPaths(null, lbDirName);
> }
>   } else {
> if (conf.isCompactionTable()) {
>   int bucketProperty = getBucketProperty(row);
>   bucketId = 
> BucketCodec.determineVersion(bucketProperty).decodeWriterId(bucketProperty);
> }
> createBucketFiles(fsp);
>   }
> }
> {noformat}
> When the first row is processed, the file is created and then the 
> filesCreated variable is set to true. Then when the other rows are processed, 
> the first if statement will be false, so no new file gets created, but the 
> row will be written into the file created for the first row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-23763) Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator

2020-08-04 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170714#comment-17170714
 ] 

Marta Kuczora commented on HIVE-23763:
--

Some details about the fix:

The insert query which is running by the MINOR compaction uses the reducers and 
not just the mappers like the query of the MAJOR compaction. Also the temp 
table the query insert into is clustered by bucket number and sorted bu bucket 
number, original transaction and row Id. Because of these, even though the 
split groups are created correctly per buckets, the rows are not always 
distributed correctly between the reducers. It can happen that one reducer and 
so one FileSinkOperator gets rows from the same bucket and then writes them 
into one file which will result a corrupted file. We cannot always prevent to 
have rows from multiple buckets in one FileSinkOperator, for example if the 
reducer number is smaller than the table's bucket number. So the 
FileSinkOperator got extended in this patch to be able to handle rows from 
multiple buckets. It is similar what the createDynamicBuckets method does for 
delete deltas.
The other part of the patch is to make sure that rows from the same bucket 
would always get to the same FileSinkOperator. Therefore the ReduceSinkOperator 
got extended to use the bucket number when distributing the rows.

 

> Query based minor compaction produces wrong files when rows with different 
> buckets Ids are processed by the same FileSinkOperator
> -
>
> Key: HIVE-23763
> URL: https://issues.apache.org/jira/browse/HIVE-23763
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> How to reproduce:
> - Create an unbucketed ACID table
> - Insert a bigger amount of data into this table so there would be multiple 
> bucket files in the table
> The files in the table should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_0_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_1_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_2_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_3_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_4_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_5_0
> - Do some delete on rows with different bucket Ids
> The files in a delete delta should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_002_002_/bucket_0
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_3
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_1
> - Run the query-based minor compaction
> - After the compaction the newly created delete delta containes only 1 bucket 
> file. This file contains rows from all buckets and the table becomes unusable
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_001_007_v066/bucket_0
> The issue happens only if rows with different bucket Ids are processed by the 
> same FileSinkOperator. 
> In the FileSinkOperator.process method, the files for the compaction table 
> are created like this:
> {noformat}
> if (!bDynParts && !filesCreated) {
>   if (lbDirName != null) {
> if (valToPaths.get(lbDirName) == null) {
>   createNewPaths(null, lbDirName);
> }
>   } else {
> if (conf.isCompactionTable()) {
>   int bucketProperty = getBucketProperty(row);
>   bucketId = 
> BucketCodec.determineVersion(bucketProperty).decodeWriterId(bucketProperty);
> }
> createBucketFiles(fsp);
>   }
> }
> {noformat}
> When the first row is processed, the file is created and then the 
> filesCreated variable is set to true. Then when the other rows are processed, 
> the first if statement will be false, so no new file gets created, but the 
> row will be written into the file created for the first row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23444) Concurrent ACID direct inserts may fail with FileNotFoundException

2020-06-25 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23444:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Pushed to master!

Thanks a lot [~pvary] for the review.

> Concurrent ACID direct inserts may fail with FileNotFoundException
> --
>
> Key: HIVE-23444
> URL: https://issues.apache.org/jira/browse/HIVE-23444
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-23444.1.patch, HIVE-23444.1.patch, 
> HIVE-23444.1.patch, HIVE-23444.1.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The following exception may occur when concurrently inserting into an ACID 
> table with static partitions and the 'hive.acid.direct.insert.enabled' 
> parameter is true. This issue occurs intermittently.
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2465) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  ~[hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   ... 13 more
> Caused by: java.io.IOException: java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getHdfsDirSnapshots(AcidUtils.java:1472)
>  ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1297) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidFilesForStats(AcidUtils.java:2695)
>  ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2448) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> 

[jira] [Updated] (HIVE-23763) Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator

2020-06-25 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23763:
-
Description: 
How to reproduce:

- Create an unbucketed ACID table
- Insert a bigger amount of data into this table so there would be multiple 
bucket files in the table
The files in the table should look like this:
/warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_0_0
/warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_1_0
/warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_2_0
/warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_3_0
/warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_4_0
/warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_5_0
- Do some delete on rows with different bucket Ids
The files in a delete delta should look like this:
/warehouse/tablespace/managed/hive/bubu_acid/delete_delta_002_002_/bucket_0
/warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_3
/warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_1
- Run the query-based minor compaction
- After the compaction the newly created delete delta containes only 1 bucket 
file. This file contains rows from all buckets and the table becomes unusable
/warehouse/tablespace/managed/hive/bubu_acid/delete_delta_001_007_v066/bucket_0

The issue happens only if rows with different bucket Ids are processed by the 
same FileSinkOperator. 
In the FileSinkOperator.process method, the files for the compaction table are 
created like this:
{noformat}
if (!bDynParts && !filesCreated) {
  if (lbDirName != null) {
if (valToPaths.get(lbDirName) == null) {
  createNewPaths(null, lbDirName);
}
  } else {
if (conf.isCompactionTable()) {
  int bucketProperty = getBucketProperty(row);
  bucketId = 
BucketCodec.determineVersion(bucketProperty).decodeWriterId(bucketProperty);
}
createBucketFiles(fsp);
  }
}
{noformat}
When the first row is processed, the file is created and then the filesCreated 
variable is set to true. Then when the other rows are processed, the first if 
statement will be false, so no new file gets created, but the row will be 
written into the file created for the first row.

> Query based minor compaction produces wrong files when rows with different 
> buckets Ids are processed by the same FileSinkOperator
> -
>
> Key: HIVE-23763
> URL: https://issues.apache.org/jira/browse/HIVE-23763
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> How to reproduce:
> - Create an unbucketed ACID table
> - Insert a bigger amount of data into this table so there would be multiple 
> bucket files in the table
> The files in the table should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_0_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_1_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_2_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_3_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_4_0
> /warehouse/tablespace/managed/hive/bubu_acid/delta_001_001_/bucket_5_0
> - Do some delete on rows with different bucket Ids
> The files in a delete delta should look like this:
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_002_002_/bucket_0
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_3
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_006_006_/bucket_1
> - Run the query-based minor compaction
> - After the compaction the newly created delete delta containes only 1 bucket 
> file. This file contains rows from all buckets and the table becomes unusable
> /warehouse/tablespace/managed/hive/bubu_acid/delete_delta_001_007_v066/bucket_0
> The issue happens only if rows with different bucket Ids are processed by the 
> same FileSinkOperator. 
> In the FileSinkOperator.process method, the files for the compaction table 
> are created like this:
> {noformat}
> if (!bDynParts && !filesCreated) {
>   if (lbDirName != null) {
> if (valToPaths.get(lbDirName) == null) {
>   

[jira] [Assigned] (HIVE-23763) Query based minor compaction produces wrong files when rows with different buckets Ids are processed by the same FileSinkOperator

2020-06-25 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-23763:



> Query based minor compaction produces wrong files when rows with different 
> buckets Ids are processed by the same FileSinkOperator
> -
>
> Key: HIVE-23763
> URL: https://issues.apache.org/jira/browse/HIVE-23763
> Project: Hive
>  Issue Type: Bug
>  Components: Transactions
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-23703) Major QB compaction with multiple FileSinkOperators results in data loss and one original file

2020-06-22 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141858#comment-17141858
 ] 

Marta Kuczora commented on HIVE-23703:
--

+1
Thanks a lot [~klcopp] for the patch!

> Major QB compaction with multiple FileSinkOperators results in data loss and 
> one original file
> --
>
> Key: HIVE-23703
> URL: https://issues.apache.org/jira/browse/HIVE-23703
> Project: Hive
>  Issue Type: Bug
>Reporter: Karen Coppage
>Assignee: Karen Coppage
>Priority: Critical
>  Labels: compaction, pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> h4. Problems
> Example:
> {code:java}
> drop table if exists tbl2;
> create transactional table tbl2 (a int, b int) clustered by (a) into 4 
> buckets stored as ORC 
> TBLPROPERTIES('transactional'='true','transactional_properties'='default');
> insert into tbl2 values(1,2),(1,3),(1,4),(2,2),(2,3),(2,4);
> insert into tbl2 values(3,2),(3,3),(3,4),(4,2),(4,3),(4,4);
> insert into tbl2 values(5,2),(5,3),(5,4),(6,2),(6,3),(6,4);{code}
> E.g. in the example above, bucketId=0 when a=2 and a=6.
> 1. Data loss 
>  In non-acid tables, an operator's temp files are named with their task id. 
> Because of this snippet, temp files in the FileSinkOperator for compaction 
> tables are identified by their bucket_id.
> {code:java}
> if (conf.isCompactionTable()) {
>  fsp.initializeBucketPaths(filesIdx, AcidUtils.BUCKET_PREFIX + 
> String.format(AcidUtils.BUCKET_DIGITS, bucketId),
>  isNativeTable(), isSkewedStoredAsSubDirectories);
>  } else {
>  fsp.initializeBucketPaths(filesIdx, taskId, isNativeTable(), 
> isSkewedStoredAsSubDirectories);
>  }
> {code}
> So 2 temp files containing data with a=2 and a=6 will be named bucket_0 and 
> not 00_0 and 00_1 as they would normally.
>  In FileSinkOperator.commit, when data with a=2, filename: bucket_0 is moved 
> from _task_tmp.-ext-10002 to _tmp.-ext-10002, it overwrites the files already 
> there with a=6 data, because it too is named bucket_0. You can see in the 
> logs:
> {code:java}
>  WARN [LocalJobRunner Map Task Executor #0] exec.FileSinkOperator: Target 
> path 
> file:.../hive/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnNoBuckets-1591107230237/warehouse/testmajorcompaction/base_002_v013/.hive-staging_hive_2020-06-02_07-15-21_771_8551447285061957908-1/_tmp.-ext-10002/bucket_0
>  with a size 610 exists. Trying to delete it.
> {code}
> 2. Results in one original file
>  OrcFileMergeOperator merges the results of the FSOp into 1 file named 
> 00_0.
> h4. Fix
> 1. FSOp will store data as: taskid/bucketId. e.g. 0_0/bucket_0
> 2. OrcMergeFileOp, instead of merging a bunch of files into 1 file named 
> 00_0, will merge all files named bucket_0 into one file named bucket_0, 
> and so on.
> 3. MoveTask will get rid of the taskId directories if present and only move 
> the bucket files in them, in case OrcMergeFileOp is not run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23444) Concurrent ACID direct inserts may fail with FileNotFoundException

2020-05-14 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23444:
-
Attachment: HIVE-23444.1.patch

> Concurrent ACID direct inserts may fail with FileNotFoundException
> --
>
> Key: HIVE-23444
> URL: https://issues.apache.org/jira/browse/HIVE-23444
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-23444.1.patch, HIVE-23444.1.patch, 
> HIVE-23444.1.patch, HIVE-23444.1.patch
>
>
> The following exception may occur when concurrently inserting into an ACID 
> table with static partitions and the 'hive.acid.direct.insert.enabled' 
> parameter is true. This issue occurs intermittently.
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2465) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  ~[hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   ... 13 more
> Caused by: java.io.IOException: java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getHdfsDirSnapshots(AcidUtils.java:1472)
>  ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1297) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidFilesForStats(AcidUtils.java:2695)
>  ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2448) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> 

[jira] [Updated] (HIVE-23444) Concurrent ACID direct inserts may fail with FileNotFoundException

2020-05-13 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23444:
-
Attachment: HIVE-23444.1.patch

> Concurrent ACID direct inserts may fail with FileNotFoundException
> --
>
> Key: HIVE-23444
> URL: https://issues.apache.org/jira/browse/HIVE-23444
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-23444.1.patch, HIVE-23444.1.patch, 
> HIVE-23444.1.patch
>
>
> The following exception may occur when concurrently inserting into an ACID 
> table with static partitions and the 'hive.acid.direct.insert.enabled' 
> parameter is true. This issue occurs intermittently.
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2465) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  ~[hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   ... 13 more
> Caused by: java.io.IOException: java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getHdfsDirSnapshots(AcidUtils.java:1472)
>  ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1297) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidFilesForStats(AcidUtils.java:2695)
>  ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2448) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 

[jira] [Updated] (HIVE-23444) Concurrent ACID direct inserts may fail with FileNotFoundException

2020-05-13 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23444:
-
Attachment: HIVE-23444.1.patch

> Concurrent ACID direct inserts may fail with FileNotFoundException
> --
>
> Key: HIVE-23444
> URL: https://issues.apache.org/jira/browse/HIVE-23444
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-23444.1.patch, HIVE-23444.1.patch
>
>
> The following exception may occur when concurrently inserting into an ACID 
> table with static partitions and the 'hive.acid.direct.insert.enabled' 
> parameter is true. This issue occurs intermittently.
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2465) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  ~[hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   ... 13 more
> Caused by: java.io.IOException: java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getHdfsDirSnapshots(AcidUtils.java:1472)
>  ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1297) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidFilesForStats(AcidUtils.java:2695)
>  ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2448) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 

[jira] [Updated] (HIVE-23444) Concurrent ACID direct inserts may fail with FileNotFoundException

2020-05-12 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23444:
-
Status: Patch Available  (was: Open)

> Concurrent ACID direct inserts may fail with FileNotFoundException
> --
>
> Key: HIVE-23444
> URL: https://issues.apache.org/jira/browse/HIVE-23444
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-23444.1.patch
>
>
> The following exception may occur when concurrently inserting into an ACID 
> table with static partitions and the 'hive.acid.direct.insert.enabled' 
> parameter is true. This issue occurs intermittently.
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2465) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  ~[hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   ... 13 more
> Caused by: java.io.IOException: java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getHdfsDirSnapshots(AcidUtils.java:1472)
>  ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1297) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidFilesForStats(AcidUtils.java:2695)
>  ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2448) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 

[jira] [Updated] (HIVE-23444) Concurrent ACID direct inserts may fail with FileNotFoundException

2020-05-12 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23444:
-
Attachment: HIVE-23444.1.patch

> Concurrent ACID direct inserts may fail with FileNotFoundException
> --
>
> Key: HIVE-23444
> URL: https://issues.apache.org/jira/browse/HIVE-23444
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-23444.1.patch
>
>
> The following exception may occur when concurrently inserting into an ACID 
> table with static partitions and the 'hive.acid.direct.insert.enabled' 
> parameter is true. This issue occurs intermittently.
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2465) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  ~[hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   ... 13 more
> Caused by: java.io.IOException: java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getHdfsDirSnapshots(AcidUtils.java:1472)
>  ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1297) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidFilesForStats(AcidUtils.java:2695)
>  ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2448) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 

[jira] [Commented] (HIVE-23442) ACID major compaction doesn't read base directory correctly if it was written by insert overwrite

2020-05-12 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105351#comment-17105351
 ] 

Marta Kuczora commented on HIVE-23442:
--

Pushed to master.
Thanks a lot [~pvary] for the review!

> ACID major compaction doesn't read base directory correctly if it was written 
> by insert overwrite
> -
>
> Key: HIVE-23442
> URL: https://issues.apache.org/jira/browse/HIVE-23442
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-23442.1.patch, HIVE-23442.2.patch
>
>
> Steps to reproduce:
> {noformat}
> SET hive.acid.direct.insert.enabled=true;
> CREATE EXTERNAL TABLE test_comp_txt(a int, b int, c int) STORED AS TEXTFILE;
> INSERT INTO test_comp_txt values (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4);
> CREATE TABLE test_comp(a int, b int, c int) STORED AS ORC 
> TBLPROPERTIES('transactional'='true');
> INSERT OVERWRITE TABLE test_comp SELECT * FROM test_comp_txt;
> UPDATE test_comp SET b=55, c=66 WHERE a=2;
> DELETE FROM test_comp WHERE a=4;
> UPDATE test_comp SET b=77 WHERE a=1;
> SELECT * FROM test_comp;
> 3 3   3
> 2 55  66
> 1 77  1
> ALTER TABLE test_comp COMPACT 'MAJOR';
> SELECT * FROM test_comp;
> 2 55  66
> 1 77  1
> {noformat}
> This issue only occurs if the base directory was created with an insert 
> overwrite command and the hive.acid.direct.insert.enabled parameter was true. 
> This issue doesn't affect the query based compaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23442) ACID major compaction doesn't read base directory correctly if it was written by insert overwrite

2020-05-12 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23442:
-
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> ACID major compaction doesn't read base directory correctly if it was written 
> by insert overwrite
> -
>
> Key: HIVE-23442
> URL: https://issues.apache.org/jira/browse/HIVE-23442
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-23442.1.patch, HIVE-23442.2.patch
>
>
> Steps to reproduce:
> {noformat}
> SET hive.acid.direct.insert.enabled=true;
> CREATE EXTERNAL TABLE test_comp_txt(a int, b int, c int) STORED AS TEXTFILE;
> INSERT INTO test_comp_txt values (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4);
> CREATE TABLE test_comp(a int, b int, c int) STORED AS ORC 
> TBLPROPERTIES('transactional'='true');
> INSERT OVERWRITE TABLE test_comp SELECT * FROM test_comp_txt;
> UPDATE test_comp SET b=55, c=66 WHERE a=2;
> DELETE FROM test_comp WHERE a=4;
> UPDATE test_comp SET b=77 WHERE a=1;
> SELECT * FROM test_comp;
> 3 3   3
> 2 55  66
> 1 77  1
> ALTER TABLE test_comp COMPACT 'MAJOR';
> SELECT * FROM test_comp;
> 2 55  66
> 1 77  1
> {noformat}
> This issue only occurs if the base directory was created with an insert 
> overwrite command and the hive.acid.direct.insert.enabled parameter was true. 
> This issue doesn't affect the query based compaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23442) ACID major compaction doesn't read base directory correctly if it was written by insert overwrite

2020-05-12 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23442:
-
Attachment: HIVE-23442.2.patch

> ACID major compaction doesn't read base directory correctly if it was written 
> by insert overwrite
> -
>
> Key: HIVE-23442
> URL: https://issues.apache.org/jira/browse/HIVE-23442
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-23442.1.patch, HIVE-23442.2.patch
>
>
> Steps to reproduce:
> {noformat}
> SET hive.acid.direct.insert.enabled=true;
> CREATE EXTERNAL TABLE test_comp_txt(a int, b int, c int) STORED AS TEXTFILE;
> INSERT INTO test_comp_txt values (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4);
> CREATE TABLE test_comp(a int, b int, c int) STORED AS ORC 
> TBLPROPERTIES('transactional'='true');
> INSERT OVERWRITE TABLE test_comp SELECT * FROM test_comp_txt;
> UPDATE test_comp SET b=55, c=66 WHERE a=2;
> DELETE FROM test_comp WHERE a=4;
> UPDATE test_comp SET b=77 WHERE a=1;
> SELECT * FROM test_comp;
> 3 3   3
> 2 55  66
> 1 77  1
> ALTER TABLE test_comp COMPACT 'MAJOR';
> SELECT * FROM test_comp;
> 2 55  66
> 1 77  1
> {noformat}
> This issue only occurs if the base directory was created with an insert 
> overwrite command and the hive.acid.direct.insert.enabled parameter was true. 
> This issue doesn't affect the query based compaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-23444) Concurrent ACID direct inserts may fail with FileNotFoundException

2020-05-11 Thread Marta Kuczora (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104875#comment-17104875
 ] 

Marta Kuczora commented on HIVE-23444:
--

The missing directory (_tmp.delta_001_001_) is the manifest 
directory which is written when inserting into an ACID table with 
hive.acid.direct.insert.enabled=true. 
The exception occurs in the AcidUtils.getHdfsDirSnapshots method when trying to 
list the newly written files from the partition directory. The manifest 
directory in case of static partitions is located in the partition folder. If 
inserts are happening concurrently, it can happen that one thread already wrote 
the manifest file, but not yet deleted it. Then an other thread calls the 
AcidUtils.getHdfsDirSnapshots method which lists all the files and directories 
from the partition folder, including the manifest directory. But then the first 
thread deletes the manifest file after the listing, but before iterating over 
the files and directories. So the iterator throws a FileNotFoundException when 
trying to get the delete manifest directory.

> Concurrent ACID direct inserts may fail with FileNotFoundException
> --
>
> Key: HIVE-23444
> URL: https://issues.apache.org/jira/browse/HIVE-23444
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> The following exception may occur when concurrently inserting into an ACID 
> table with static partitions and the 'hive.acid.direct.insert.enabled' 
> parameter is true. This issue occurs intermittently.
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2465) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
>  ~[hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   ... 13 more
> Caused by: java.io.IOException: java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getHdfsDirSnapshots(AcidUtils.java:1472)
>  ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1297) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.io.AcidUtils.getAcidFilesForStats(AcidUtils.java:2695)
>  ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> 

[jira] [Updated] (HIVE-23444) Concurrent ACID direct inserts may fail with FileNotFoundException

2020-05-11 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23444:
-
Description: 
The following exception may occur when concurrently inserting into an ACID 
table with static partitions and the 'hive.acid.direct.insert.enabled' 
parameter is true. This issue occurs intermittently.

{noformat}
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.IOException: java.io.FileNotFoundException: File 
hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
 does not exist.
at 
org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2465) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
 ~[hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
... 13 more
Caused by: java.io.IOException: java.io.FileNotFoundException: File 
hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
 does not exist.
at 
org.apache.hadoop.hive.ql.io.AcidUtils.getHdfsDirSnapshots(AcidUtils.java:1472) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:1297) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hadoop.hive.ql.io.AcidUtils.getAcidFilesForStats(AcidUtils.java:2695)
 ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2448) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 

[jira] [Updated] (HIVE-23444) Concurrent ACID direct inserts may fail with FileNotFoundException

2020-05-11 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23444:
-
Description: 
The following exception may occur when concurrently inserting into an ACID 
table with static partitions and the 'hive.acid.direct.insert.enabled' 
parameter is true. This issue occurs intermittently.

{noformat}
2020-04-30 15:56:54,706 ERROR org.apache.hive.service.cli.operation.Operation: 
[HiveServer2-Background-Pool: Thread-675]: Error running hive query: 
org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: 
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.MoveTask. java.io.IOException: 
java.io.FileNotFoundException: File 
hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
 does not exist.
at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:362)
 ~[hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:241)
 ~[hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:87)
 ~[hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:322)
 [hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at java.security.AccessController.doPrivileged(Native Method) [?:?]
at javax.security.auth.Subject.doAs(Subject.java:423) [?:?]
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
 [hadoop-common-3.1.1.7.1.1.0-493.jar:?]
at 
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:340)
 [hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.IOException: java.io.FileNotFoundException: File 
hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
 does not exist.
at 
org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2465) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:166) 
~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
at 
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:225)
 ~[hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]

[jira] [Assigned] (HIVE-23444) Concurrent ACID direct inserts may fail with FileNotFoundException

2020-05-11 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-23444:



> Concurrent ACID direct inserts may fail with FileNotFoundException
> --
>
> Key: HIVE-23444
> URL: https://issues.apache.org/jira/browse/HIVE-23444
> Project: Hive
>  Issue Type: Bug
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> {noformat}
> 2020-04-30 15:56:54,706 ERROR 
> org.apache.hive.service.cli.operation.Operation: 
> [HiveServer2-Background-Pool: Thread-675]: Error running hive query: 
> org.apache.hive.service.cli.HiveSQLException: Error while compiling 
> statement: FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.MoveTask. java.io.IOException: 
> java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:362)
>  ~[hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:241)
>  ~[hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:87)
>  ~[hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:322)
>  [hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at java.security.AccessController.doPrivileged(Native Method) [?:?]
>   at javax.security.auth.Subject.doAs(Subject.java:423) [?:?]
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>  [hadoop-common-3.1.1.7.1.1.0-493.jar:?]
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:340)
>  [hive-service-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
>   at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>   at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: java.io.FileNotFoundException: File 
> hdfs://ns1/warehouse/tablespace/managed/hive/tpch_unbucketed.db/concurrent_insert_partitioned/l_tax=0.0/_tmp.delta_001_001_
>  does not exist.
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:2465) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2228) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.MoveTask.handleStaticParts(MoveTask.java:522) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:442) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:359) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:330) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:109) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:721) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:488) 
> ~[hive-exec-3.1.3000.7.1.1.0-493.jar:3.1.3000.7.1.1.0-493]
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:482) 
> 

[jira] [Updated] (HIVE-23442) ACID major compaction doesn't read base directory correctly if it was written by insert overwrite

2020-05-11 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23442:
-
Status: Patch Available  (was: Open)

> ACID major compaction doesn't read base directory correctly if it was written 
> by insert overwrite
> -
>
> Key: HIVE-23442
> URL: https://issues.apache.org/jira/browse/HIVE-23442
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-23442.1.patch
>
>
> Steps to reproduce:
> {noformat}
> SET hive.acid.direct.insert.enabled=true;
> CREATE EXTERNAL TABLE test_comp_txt(a int, b int, c int) STORED AS TEXTFILE;
> INSERT INTO test_comp_txt values (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4);
> CREATE TABLE test_comp(a int, b int, c int) STORED AS ORC 
> TBLPROPERTIES('transactional'='true');
> INSERT OVERWRITE TABLE test_comp SELECT * FROM test_comp_txt;
> UPDATE test_comp SET b=55, c=66 WHERE a=2;
> DELETE FROM test_comp WHERE a=4;
> UPDATE test_comp SET b=77 WHERE a=1;
> SELECT * FROM test_comp;
> 3 3   3
> 2 55  66
> 1 77  1
> ALTER TABLE test_comp COMPACT 'MAJOR';
> SELECT * FROM test_comp;
> 2 55  66
> 1 77  1
> {noformat}
> This issue only occurs if the base directory was created with an insert 
> overwrite command and the hive.acid.direct.insert.enabled parameter was true. 
> This issue doesn't affect the query based compaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23442) ACID major compaction doesn't read base directory correctly if it was written by insert overwrite

2020-05-11 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23442:
-
Attachment: HIVE-23442.1.patch

> ACID major compaction doesn't read base directory correctly if it was written 
> by insert overwrite
> -
>
> Key: HIVE-23442
> URL: https://issues.apache.org/jira/browse/HIVE-23442
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-23442.1.patch
>
>
> Steps to reproduce:
> {noformat}
> SET hive.acid.direct.insert.enabled=true;
> CREATE EXTERNAL TABLE test_comp_txt(a int, b int, c int) STORED AS TEXTFILE;
> INSERT INTO test_comp_txt values (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4);
> CREATE TABLE test_comp(a int, b int, c int) STORED AS ORC 
> TBLPROPERTIES('transactional'='true');
> INSERT OVERWRITE TABLE test_comp SELECT * FROM test_comp_txt;
> UPDATE test_comp SET b=55, c=66 WHERE a=2;
> DELETE FROM test_comp WHERE a=4;
> UPDATE test_comp SET b=77 WHERE a=1;
> SELECT * FROM test_comp;
> 3 3   3
> 2 55  66
> 1 77  1
> ALTER TABLE test_comp COMPACT 'MAJOR';
> SELECT * FROM test_comp;
> 2 55  66
> 1 77  1
> {noformat}
> This issue only occurs if the base directory was created with an insert 
> overwrite command and the hive.acid.direct.insert.enabled parameter was true. 
> This issue doesn't affect the query based compaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23442) ACID major compaction doesn't read base directory correctly if it was written by insert overwrite

2020-05-11 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23442:
-
Description: 
Steps to reproduce:
{noformat}
SET hive.acid.direct.insert.enabled=true;

CREATE EXTERNAL TABLE test_comp_txt(a int, b int, c int) STORED AS TEXTFILE;
INSERT INTO test_comp_txt values (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4);

CREATE TABLE test_comp(a int, b int, c int) STORED AS ORC 
TBLPROPERTIES('transactional'='true');
INSERT OVERWRITE TABLE test_comp SELECT * FROM test_comp_txt;

UPDATE test_comp SET b=55, c=66 WHERE a=2;
DELETE FROM test_comp WHERE a=4;
UPDATE test_comp SET b=77 WHERE a=1;

SELECT * FROM test_comp;
3   3   3
2   55  66
1   77  1

ALTER TABLE test_comp COMPACT 'MAJOR';

SELECT * FROM test_comp;
2   55  66
1   77  1
{noformat}

This issue only occurs if the base directory was created with an insert 
overwrite command and the hive.acid.direct.insert.enabled parameter was true. 
This issue doesn't affect the query based compaction.

  was:
Steps to reproduce:
{noformat}
SET hive.acid.direct.insert.enabled=true;

CREATE EXTERNAL TABLE test_comp_txt(a int, b int, c int) STORED AS TEXTFILE;
INSERT INTO test_comp_txt values (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4);

CREATE TABLE test_comp(a int, b int, c int) STORED AS ORC 
TBLPROPERTIES('transactional'='true');
INSERT OVERWRITE TABLE test_comp SELECT * FROM test_comp_txt;

UPDATE test_comp SET b=55, c=66 WHERE a=2;
DELETE FROM test_comp WHERE a=4;
UPDATE test_comp SET b=77 WHERE a=1;

SELECT * FROM test_comp;
3   3   3
2   55  66
1   77  1

ALTER TABLE test_comp COMPACT 'MAJOR';

SELECT * FROM test_comp;
2   55  66
1   77  1
{noformat}

This issue only occurs if the base directory was created with an insert 
overwrite command and the hive.acid.direct.insert.enabled was true. This issue 
doesn't affect the query based compaction.


> ACID major compaction doesn't read base directory correctly if it was written 
> by insert overwrite
> -
>
> Key: HIVE-23442
> URL: https://issues.apache.org/jira/browse/HIVE-23442
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> Steps to reproduce:
> {noformat}
> SET hive.acid.direct.insert.enabled=true;
> CREATE EXTERNAL TABLE test_comp_txt(a int, b int, c int) STORED AS TEXTFILE;
> INSERT INTO test_comp_txt values (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4);
> CREATE TABLE test_comp(a int, b int, c int) STORED AS ORC 
> TBLPROPERTIES('transactional'='true');
> INSERT OVERWRITE TABLE test_comp SELECT * FROM test_comp_txt;
> UPDATE test_comp SET b=55, c=66 WHERE a=2;
> DELETE FROM test_comp WHERE a=4;
> UPDATE test_comp SET b=77 WHERE a=1;
> SELECT * FROM test_comp;
> 3 3   3
> 2 55  66
> 1 77  1
> ALTER TABLE test_comp COMPACT 'MAJOR';
> SELECT * FROM test_comp;
> 2 55  66
> 1 77  1
> {noformat}
> This issue only occurs if the base directory was created with an insert 
> overwrite command and the hive.acid.direct.insert.enabled parameter was true. 
> This issue doesn't affect the query based compaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23442) ACID major compaction doesn't read base directory correctly if it was written by insert overwrite

2020-05-11 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23442:
-
Description: 
Steps to reproduce:
{noformat}
SET hive.acid.direct.insert.enabled=true;

CREATE EXTERNAL TABLE test_comp_txt(a int, b int, c int) STORED AS TEXTFILE;
INSERT INTO test_comp_txt values (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4);

CREATE TABLE test_comp(a int, b int, c int) STORED AS ORC 
TBLPROPERTIES('transactional'='true');
INSERT OVERWRITE TABLE test_comp SELECT * FROM test_comp_txt;

UPDATE test_comp SET b=55, c=66 WHERE a=2;
DELETE FROM test_comp WHERE a=4;
UPDATE test_comp SET b=77 WHERE a=1;

SELECT * FROM test_comp;
3   3   3
2   55  66
1   77  1

ALTER TABLE test_comp COMPACT 'MAJOR';

SELECT * FROM test_comp;
2   55  66
1   77  1
{noformat}

This issue only occurs if the base directory was created with an insert 
overwrite command and the hive.acid.direct.insert.enabled was true. This issue 
doesn't affect the query based compaction.

> ACID major compaction doesn't read base directory correctly if it was written 
> by insert overwrite
> -
>
> Key: HIVE-23442
> URL: https://issues.apache.org/jira/browse/HIVE-23442
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>
> Steps to reproduce:
> {noformat}
> SET hive.acid.direct.insert.enabled=true;
> CREATE EXTERNAL TABLE test_comp_txt(a int, b int, c int) STORED AS TEXTFILE;
> INSERT INTO test_comp_txt values (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4);
> CREATE TABLE test_comp(a int, b int, c int) STORED AS ORC 
> TBLPROPERTIES('transactional'='true');
> INSERT OVERWRITE TABLE test_comp SELECT * FROM test_comp_txt;
> UPDATE test_comp SET b=55, c=66 WHERE a=2;
> DELETE FROM test_comp WHERE a=4;
> UPDATE test_comp SET b=77 WHERE a=1;
> SELECT * FROM test_comp;
> 3 3   3
> 2 55  66
> 1 77  1
> ALTER TABLE test_comp COMPACT 'MAJOR';
> SELECT * FROM test_comp;
> 2 55  66
> 1 77  1
> {noformat}
> This issue only occurs if the base directory was created with an insert 
> overwrite command and the hive.acid.direct.insert.enabled was true. This 
> issue doesn't affect the query based compaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23442) ACID major compaction doesn't read base directory correct if it was written by insert overwrite by direct insert

2020-05-11 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23442:
-
Summary: ACID major compaction doesn't read base directory correct if it 
was written by insert overwrite by direct insert  (was: ACID major compaction 
doesn't read base correct if it was written by insert overwrite by direct 
insert)

> ACID major compaction doesn't read base directory correct if it was written 
> by insert overwrite by direct insert
> 
>
> Key: HIVE-23442
> URL: https://issues.apache.org/jira/browse/HIVE-23442
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23442) ACID major compaction doesn't read base directory correctly if it was written by insert overwrite

2020-05-11 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23442:
-
Summary: ACID major compaction doesn't read base directory correctly if it 
was written by insert overwrite  (was: ACID major compaction doesn't read base 
directory correctly if it was written by insert overwrite by direct insert)

> ACID major compaction doesn't read base directory correctly if it was written 
> by insert overwrite
> -
>
> Key: HIVE-23442
> URL: https://issues.apache.org/jira/browse/HIVE-23442
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HIVE-23442) ACID major compaction doesn't read base directory correctly if it was written by insert overwrite by direct insert

2020-05-11 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-23442:
-
Summary: ACID major compaction doesn't read base directory correctly if it 
was written by insert overwrite by direct insert  (was: ACID major compaction 
doesn't read base directory correct if it was written by insert overwrite by 
direct insert)

> ACID major compaction doesn't read base directory correctly if it was written 
> by insert overwrite by direct insert
> --
>
> Key: HIVE-23442
> URL: https://issues.apache.org/jira/browse/HIVE-23442
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HIVE-23442) ACID major compaction doesn't read base correct if it was written by insert overwrite by direct insert

2020-05-11 Thread Marta Kuczora (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora reassigned HIVE-23442:


Fix Version/s: 4.0.0
Affects Version/s: 4.0.0
 Assignee: Marta Kuczora

> ACID major compaction doesn't read base correct if it was written by insert 
> overwrite by direct insert
> --
>
> Key: HIVE-23442
> URL: https://issues.apache.org/jira/browse/HIVE-23442
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Marta Kuczora
>Assignee: Marta Kuczora
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   >