[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=587765=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-587765
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 23/Apr/21 10:26
Start Date: 23/Apr/21 10:26
Worklog Time Spent: 10m 
  Work Description: pvary merged pull request #2161:
URL: https://github.com/apache/hive/pull/2161


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 587765)
Time Spent: 3h 50m  (was: 3h 40m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=586533=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-586533
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 21/Apr/21 12:34
Start Date: 21/Apr/21 12:34
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r617491058



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##
@@ -346,25 +372,23 @@ private static ExecutorService 
tableExecutor(Configuration conf, int maxThreadNu
 
   /**
* Get the committed data files for this table and job.
+   *
+   * @param numTasks Number of writer tasks that produced a forCommit file
* @param executor The executor used for reading the forCommit files parallel
* @param location The location of the table
* @param jobContext The job context
* @param io The FileIO used for reading a files generated for commit
* @param throwOnFailure If true then it throws an exception on 
failure
* @return The list of the committed data files
*/
-  private static Collection dataFiles(ExecutorService executor, 
String location, JobContext jobContext,
-  FileIO io, boolean throwOnFailure) {
+  private static Collection dataFiles(int numTasks, ExecutorService 
executor, String location,
+JobContext jobContext, FileIO 
io, boolean throwOnFailure) {

Review comment:
   Fun stuff 
   Whatever!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 586533)
Time Spent: 3h 40m  (was: 3.5h)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=586532=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-586532
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 21/Apr/21 12:33
Start Date: 21/Apr/21 12:33
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r617490443



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##
@@ -239,6 +252,16 @@ public void abortJob(JobContext originalContext, int 
status) throws IOException
 cleanup(jobContext, jobLocations);
   }
 
+  private Set listForCommits(JobConf jobConf, String jobLocation) 
throws IOException {

Review comment:
   Right, good idea




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 586532)
Time Spent: 3.5h  (was: 3h 20m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=586529=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-586529
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 21/Apr/21 12:33
Start Date: 21/Apr/21 12:33
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r617490157



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##
@@ -346,25 +372,23 @@ private static ExecutorService 
tableExecutor(Configuration conf, int maxThreadNu
 
   /**
* Get the committed data files for this table and job.
+   *
+   * @param numTasks Number of writer tasks that produced a forCommit file
* @param executor The executor used for reading the forCommit files parallel
* @param location The location of the table
* @param jobContext The job context
* @param io The FileIO used for reading a files generated for commit
* @param throwOnFailure If true then it throws an exception on 
failure
* @return The list of the committed data files
*/
-  private static Collection dataFiles(ExecutorService executor, 
String location, JobContext jobContext,
-  FileIO io, boolean throwOnFailure) {
+  private static Collection dataFiles(int numTasks, ExecutorService 
executor, String location,
+JobContext jobContext, FileIO 
io, boolean throwOnFailure) {

Review comment:
   They are a bit confusing about this. The 4 spaces padding is indeed the 
general rule for line continuation, but I've been asked earlier by Anton on 
other PRs not to do that for method parameter indentations and do it this way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 586529)
Time Spent: 3h 20m  (was: 3h 10m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=586527=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-586527
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 21/Apr/21 12:29
Start Date: 21/Apr/21 12:29
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r617487622



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##
@@ -239,6 +252,16 @@ public void abortJob(JobContext originalContext, int 
status) throws IOException
 cleanup(jobContext, jobLocations);
   }
 
+  private Set listForCommits(JobConf jobConf, String jobLocation) 
throws IOException {

Review comment:
   nit: Maybe a javadoc that do not use it for anything else than abort.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 586527)
Time Spent: 3h 10m  (was: 3h)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=586526=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-586526
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 21/Apr/21 12:28
Start Date: 21/Apr/21 12:28
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r617486952



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##
@@ -346,25 +372,23 @@ private static ExecutorService 
tableExecutor(Configuration conf, int maxThreadNu
 
   /**
* Get the committed data files for this table and job.
+   *
+   * @param numTasks Number of writer tasks that produced a forCommit file
* @param executor The executor used for reading the forCommit files parallel
* @param location The location of the table
* @param jobContext The job context
* @param io The FileIO used for reading a files generated for commit
* @param throwOnFailure If true then it throws an exception on 
failure
* @return The list of the committed data files
*/
-  private static Collection dataFiles(ExecutorService executor, 
String location, JobContext jobContext,
-  FileIO io, boolean throwOnFailure) {
+  private static Collection dataFiles(int numTasks, ExecutorService 
executor, String location,
+JobContext jobContext, FileIO 
io, boolean throwOnFailure) {

Review comment:
   nit: I have seen Iceberg reviewers asking for 4 space padding.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 586526)
Time Spent: 3h  (was: 2h 50m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=586377=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-586377
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 21/Apr/21 07:33
Start Date: 21/Apr/21 07:33
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r617272290



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##
@@ -105,13 +105,18 @@ public void commitTask(TaskAttemptContext 
originalContext) throws IOException {
   .executeWith(tableExecutor)
   .run(output -> {
 Table table = 
HiveIcebergStorageHandler.table(context.getJobConf(), output);
-HiveIcebergRecordWriter writer = writers.get(output);
-DataFile[] closedFiles = writer != null ? writer.dataFiles() : new 
DataFile[0];
-String fileForCommitLocation = 
generateFileForCommitLocation(table.location(), jobConf,
-attemptID.getJobID(), attemptID.getTaskID().getId());
-
-// Creating the file containing the data files generated by this 
task for this table
-createFileForCommit(closedFiles, fileForCommitLocation, 
table.io());
+if (table != null) {

Review comment:
   Yeah, I think you're right




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 586377)
Time Spent: 2h 50m  (was: 2h 40m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=586332=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-586332
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 21/Apr/21 05:17
Start Date: 21/Apr/21 05:17
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r617206293



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##
@@ -105,13 +105,18 @@ public void commitTask(TaskAttemptContext 
originalContext) throws IOException {
   .executeWith(tableExecutor)
   .run(output -> {
 Table table = 
HiveIcebergStorageHandler.table(context.getJobConf(), output);
-HiveIcebergRecordWriter writer = writers.get(output);
-DataFile[] closedFiles = writer != null ? writer.dataFiles() : new 
DataFile[0];
-String fileForCommitLocation = 
generateFileForCommitLocation(table.location(), jobConf,
-attemptID.getJobID(), attemptID.getTaskID().getId());
-
-// Creating the file containing the data files generated by this 
task for this table
-createFileForCommit(closedFiles, fileForCommitLocation, 
table.io());
+if (table != null) {

Review comment:
   I think this is the same situation that we have been with some other Tez 
patches. It does not hurt to be there an we keep the two codes from diverging. 
But it's up to you to decide. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 586332)
Time Spent: 2h 40m  (was: 2.5h)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=586208=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-586208
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 20/Apr/21 21:24
Start Date: 20/Apr/21 21:24
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r617040976



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##
@@ -105,13 +105,18 @@ public void commitTask(TaskAttemptContext 
originalContext) throws IOException {
   .executeWith(tableExecutor)
   .run(output -> {
 Table table = 
HiveIcebergStorageHandler.table(context.getJobConf(), output);
-HiveIcebergRecordWriter writer = writers.get(output);
-DataFile[] closedFiles = writer != null ? writer.dataFiles() : new 
DataFile[0];
-String fileForCommitLocation = 
generateFileForCommitLocation(table.location(), jobConf,
-attemptID.getJobID(), attemptID.getTaskID().getId());
-
-// Creating the file containing the data files generated by this 
task for this table
-createFileForCommit(closedFiles, fileForCommitLocation, 
table.io());
+if (table != null) {

Review comment:
   Actually, this issue does not occur with mr, only with tez, therefore it 
currently does not come up in upstream Iceberg. Will fix this once Tez writes 
have been enabled there too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 586208)
Time Spent: 2.5h  (was: 2h 20m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=585781=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585781
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 20/Apr/21 12:59
Start Date: 20/Apr/21 12:59
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r616658639



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##
@@ -105,13 +105,18 @@ public void commitTask(TaskAttemptContext 
originalContext) throws IOException {
   .executeWith(tableExecutor)
   .run(output -> {
 Table table = 
HiveIcebergStorageHandler.table(context.getJobConf(), output);
-HiveIcebergRecordWriter writer = writers.get(output);
-DataFile[] closedFiles = writer != null ? writer.dataFiles() : new 
DataFile[0];
-String fileForCommitLocation = 
generateFileForCommitLocation(table.location(), jobConf,
-attemptID.getJobID(), attemptID.getTaskID().getId());
-
-// Creating the file containing the data files generated by this 
task for this table
-createFileForCommit(closedFiles, fileForCommitLocation, 
table.io());
+if (table != null) {

Review comment:
   Sure, will do!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 585781)
Time Spent: 2h 20m  (was: 2h 10m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=585777=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585777
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 20/Apr/21 12:48
Start Date: 20/Apr/21 12:48
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r616649742



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##
@@ -105,13 +105,18 @@ public void commitTask(TaskAttemptContext 
originalContext) throws IOException {
   .executeWith(tableExecutor)
   .run(output -> {
 Table table = 
HiveIcebergStorageHandler.table(context.getJobConf(), output);
-HiveIcebergRecordWriter writer = writers.get(output);
-DataFile[] closedFiles = writer != null ? writer.dataFiles() : new 
DataFile[0];
-String fileForCommitLocation = 
generateFileForCommitLocation(table.location(), jobConf,
-attemptID.getJobID(), attemptID.getTaskID().getId());
-
-// Creating the file containing the data files generated by this 
task for this table
-createFileForCommit(closedFiles, fileForCommitLocation, 
table.io());
+if (table != null) {

Review comment:
   Got it, then we should push this change to the Iceberg code as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 585777)
Time Spent: 2h 10m  (was: 2h)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=585745=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585745
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 20/Apr/21 11:57
Start Date: 20/Apr/21 11:57
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r616613187



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##
@@ -256,4 +265,74 @@ private static PartitionSpec spec(Schema schema, 
Properties properties,
   return PartitionSpec.unpartitioned();
 }
   }
+
+  @Override
+  public void commitInsertTable(org.apache.hadoop.hive.metastore.api.Table 
table, boolean overwrite)
+  throws MetaException {
+String tableName = TableIdentifier.of(table.getDbName(), 
table.getTableName()).toString();
+
+// check status to determine whether we need to commit or to abort
+JobConf jobConf = new JobConf(conf);
+String queryIdKey = jobConf.get("hive.query.id") + "." + tableName + 
".result";
+boolean success = jobConf.getBoolean(queryIdKey, false);
+
+// construct the job context
+JobID jobID = JobID.forName(jobConf.get(TezTask.HIVE_TEZ_COMMIT_JOB_ID + 
"." + tableName));
+int numTasks = conf.getInt(TezTask.HIVE_TEZ_COMMIT_TASK_COUNT + "." + 
tableName, -1);
+jobConf.setNumReduceTasks(numTasks);
+JobContext jobContext = new JobContextImpl(jobConf, jobID, null);
+
+// we should only commit this current table because
+// for multi-table inserts, this hook method will be called sequentially 
for each target table
+jobConf.set(InputFormatConfig.OUTPUT_TABLES, tableName);
+
+OutputCommitter committer = new HiveIcebergOutputCommitter();
+try {
+  if (success) {
+try {
+  committer.commitJob(jobContext);
+} catch (Exception commitExc) {
+  LOG.error("Error while trying to commit job (table: {}, jobID: {}). 
Will abort it now.",
+  tableName, jobID, commitExc);
+  abortJob(jobContext, committer, true);
+  throw new MetaException("Unable to commit job: " + 
commitExc.getMessage());
+}
+  } else {
+abortJob(jobContext, committer, false);
+  }
+} finally {
+  // avoid config pollution with prefixed/suffixed keys
+  cleanCommitConfig(queryIdKey, tableName);
+}
+  }
+
+  private void abortJob(JobContext jobContext, OutputCommitter committer, 
boolean suppressExc) throws MetaException {
+try {
+  committer.abortJob(jobContext, JobStatus.State.FAILED);
+} catch (IOException abortExc) {
+  LOG.error("Error while trying to abort failed job. There might be 
uncleaned data files.", abortExc);
+  if (!suppressExc) {
+throw new MetaException("Unable to abort job: " + 
abortExc.getMessage());
+  }
+}
+  }
+
+  private void cleanCommitConfig(String queryIdKey, String tableName) {
+conf.unset(TezTask.HIVE_TEZ_COMMIT_JOB_ID + "." + tableName);
+conf.unset(TezTask.HIVE_TEZ_COMMIT_TASK_COUNT + "." + tableName);
+conf.unset(InputFormatConfig.SERIALIZED_TABLE_PREFIX + tableName);
+conf.unset(queryIdKey);
+  }
+
+  @Override
+  public void preInsertTable(org.apache.hadoop.hive.metastore.api.Table table, 
boolean overwrite)
+  throws MetaException {
+// do nothing
+  }
+
+  @Override
+  public void rollbackInsertTable(org.apache.hadoop.hive.metastore.api.Table 
table, boolean overwrite)
+  throws MetaException {
+// do nothing

Review comment:
   I didn't put it there because f there is an execution error we should 
get a non-0 return code from Tez AM, instead of an exception. But now that I'm 
thinking about it, I suppose Hive will throw an exception at the end if the 
code was non-0, but I will test it out to make sure




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 585745)
Time Spent: 2h  (was: 1h 50m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to 

[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=585741=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585741
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 20/Apr/21 11:52
Start Date: 20/Apr/21 11:52
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r616608877



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##
@@ -105,13 +105,18 @@ public void commitTask(TaskAttemptContext 
originalContext) throws IOException {
   .executeWith(tableExecutor)
   .run(output -> {
 Table table = 
HiveIcebergStorageHandler.table(context.getJobConf(), output);
-HiveIcebergRecordWriter writer = writers.get(output);
-DataFile[] closedFiles = writer != null ? writer.dataFiles() : new 
DataFile[0];
-String fileForCommitLocation = 
generateFileForCommitLocation(table.location(), jobConf,
-attemptID.getJobID(), attemptID.getTaskID().getId());
-
-// Creating the file containing the data files generated by this 
task for this table
-createFileForCommit(closedFiles, fileForCommitLocation, 
table.io());
+if (table != null) {

Review comment:
   This happens during task commit, so before the commitInsert hook is 
called. 
   
   The essential problem here is that `OUTPUT_TABLES` contains all the tables, 
however, only those tables are serialized into the jobconfig that are relevant 
for the given task. So it tries to iterate over 1...N tables (based on 
`OUTPUT_TABLES`), but only has access to serialized Table 1 (hence the if). The 
whole parallel commit logic for multitable inserts on both the task commit and 
job commit side are broken I think, if there is more than one vertex writing to 
target tables. Currently the tests pass because it creates a single writer 
vertex, which will have both tables serialized into its config.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 585741)
Time Spent: 1h 50m  (was: 1h 40m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=585739=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585739
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 20/Apr/21 11:50
Start Date: 20/Apr/21 11:50
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r616608877



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##
@@ -105,13 +105,18 @@ public void commitTask(TaskAttemptContext 
originalContext) throws IOException {
   .executeWith(tableExecutor)
   .run(output -> {
 Table table = 
HiveIcebergStorageHandler.table(context.getJobConf(), output);
-HiveIcebergRecordWriter writer = writers.get(output);
-DataFile[] closedFiles = writer != null ? writer.dataFiles() : new 
DataFile[0];
-String fileForCommitLocation = 
generateFileForCommitLocation(table.location(), jobConf,
-attemptID.getJobID(), attemptID.getTaskID().getId());
-
-// Creating the file containing the data files generated by this 
task for this table
-createFileForCommit(closedFiles, fileForCommitLocation, 
table.io());
+if (table != null) {

Review comment:
   This happens during task commit, so before the commitInsert hook is 
called. 
   
   The essential problem here is that `OUTPUT_TABLES` contains all the tables, 
however, only those tables are serialized into the jobconfig that are relevant 
for the given task. So it tries to iterate over 1...N tables, but only has 
access to serialized Table 1 (hence the if).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 585739)
Time Spent: 1h 40m  (was: 1.5h)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=585732=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585732
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 20/Apr/21 11:42
Start Date: 20/Apr/21 11:42
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r616603785



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergOutputCommitter.java
##
@@ -105,13 +105,18 @@ public void commitTask(TaskAttemptContext 
originalContext) throws IOException {
   .executeWith(tableExecutor)
   .run(output -> {
 Table table = 
HiveIcebergStorageHandler.table(context.getJobConf(), output);
-HiveIcebergRecordWriter writer = writers.get(output);
-DataFile[] closedFiles = writer != null ? writer.dataFiles() : new 
DataFile[0];
-String fileForCommitLocation = 
generateFileForCommitLocation(table.location(), jobConf,
-attemptID.getJobID(), attemptID.getTaskID().getId());
-
-// Creating the file containing the data files generated by this 
task for this table
-createFileForCommit(closedFiles, fileForCommitLocation, 
table.io());
+if (table != null) {

Review comment:
   Could we do this check in `HiveIcebergMetaHook.commitInsertTable`?
   Then we would not need any change here...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 585732)
Time Spent: 1.5h  (was: 1h 20m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=585731=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585731
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 20/Apr/21 11:41
Start Date: 20/Apr/21 11:41
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r616603042



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##
@@ -256,4 +265,74 @@ private static PartitionSpec spec(Schema schema, 
Properties properties,
   return PartitionSpec.unpartitioned();
 }
   }
+
+  @Override
+  public void commitInsertTable(org.apache.hadoop.hive.metastore.api.Table 
table, boolean overwrite)
+  throws MetaException {
+String tableName = TableIdentifier.of(table.getDbName(), 
table.getTableName()).toString();
+
+// check status to determine whether we need to commit or to abort
+JobConf jobConf = new JobConf(conf);
+String queryIdKey = jobConf.get("hive.query.id") + "." + tableName + 
".result";
+boolean success = jobConf.getBoolean(queryIdKey, false);
+
+// construct the job context
+JobID jobID = JobID.forName(jobConf.get(TezTask.HIVE_TEZ_COMMIT_JOB_ID + 
"." + tableName));
+int numTasks = conf.getInt(TezTask.HIVE_TEZ_COMMIT_TASK_COUNT + "." + 
tableName, -1);
+jobConf.setNumReduceTasks(numTasks);
+JobContext jobContext = new JobContextImpl(jobConf, jobID, null);
+
+// we should only commit this current table because
+// for multi-table inserts, this hook method will be called sequentially 
for each target table
+jobConf.set(InputFormatConfig.OUTPUT_TABLES, tableName);
+
+OutputCommitter committer = new HiveIcebergOutputCommitter();
+try {
+  if (success) {
+try {
+  committer.commitJob(jobContext);
+} catch (Exception commitExc) {
+  LOG.error("Error while trying to commit job (table: {}, jobID: {}). 
Will abort it now.",
+  tableName, jobID, commitExc);
+  abortJob(jobContext, committer, true);
+  throw new MetaException("Unable to commit job: " + 
commitExc.getMessage());
+}
+  } else {
+abortJob(jobContext, committer, false);
+  }
+} finally {
+  // avoid config pollution with prefixed/suffixed keys
+  cleanCommitConfig(queryIdKey, tableName);
+}
+  }
+
+  private void abortJob(JobContext jobContext, OutputCommitter committer, 
boolean suppressExc) throws MetaException {
+try {
+  committer.abortJob(jobContext, JobStatus.State.FAILED);
+} catch (IOException abortExc) {
+  LOG.error("Error while trying to abort failed job. There might be 
uncleaned data files.", abortExc);
+  if (!suppressExc) {
+throw new MetaException("Unable to abort job: " + 
abortExc.getMessage());
+  }
+}
+  }
+
+  private void cleanCommitConfig(String queryIdKey, String tableName) {
+conf.unset(TezTask.HIVE_TEZ_COMMIT_JOB_ID + "." + tableName);
+conf.unset(TezTask.HIVE_TEZ_COMMIT_TASK_COUNT + "." + tableName);
+conf.unset(InputFormatConfig.SERIALIZED_TABLE_PREFIX + tableName);
+conf.unset(queryIdKey);
+  }
+
+  @Override
+  public void preInsertTable(org.apache.hadoop.hive.metastore.api.Table table, 
boolean overwrite)
+  throws MetaException {
+// do nothing
+  }
+
+  @Override
+  public void rollbackInsertTable(org.apache.hadoop.hive.metastore.api.Table 
table, boolean overwrite)
+  throws MetaException {
+// do nothing

Review comment:
   Shouldn't we call abortJob here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 585731)
Time Spent: 1h 20m  (was: 1h 10m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=585661=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585661
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 20/Apr/21 09:29
Start Date: 20/Apr/21 09:29
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r616510710



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##
@@ -256,4 +265,74 @@ private static PartitionSpec spec(Schema schema, 
Properties properties,
   return PartitionSpec.unpartitioned();
 }
   }
+
+  @Override
+  public void commitInsertTable(org.apache.hadoop.hive.metastore.api.Table 
table, boolean overwrite)
+  throws MetaException {
+String tableName = TableIdentifier.of(table.getDbName(), 
table.getTableName()).toString();
+
+// check status to determine whether we need to commit or to abort
+JobConf jobConf = new JobConf(conf);
+String queryIdKey = jobConf.get("hive.query.id") + "." + tableName + 
".result";

Review comment:
   Makes sense. I was thinking of replacing the query id with a constant 
like `HIVE_TEZ_COMMIT_JOB_RESULT` so it would be `HIVE_TEZ_COMMIT_JOB_RESULT + 
"." + tableName`, like for job id and task num to make things consisent.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 585661)
Time Spent: 1h 10m  (was: 1h)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=585660=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585660
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 20/Apr/21 09:29
Start Date: 20/Apr/21 09:29
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r616510710



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##
@@ -256,4 +265,74 @@ private static PartitionSpec spec(Schema schema, 
Properties properties,
   return PartitionSpec.unpartitioned();
 }
   }
+
+  @Override
+  public void commitInsertTable(org.apache.hadoop.hive.metastore.api.Table 
table, boolean overwrite)
+  throws MetaException {
+String tableName = TableIdentifier.of(table.getDbName(), 
table.getTableName()).toString();
+
+// check status to determine whether we need to commit or to abort
+JobConf jobConf = new JobConf(conf);
+String queryIdKey = jobConf.get("hive.query.id") + "." + tableName + 
".result";

Review comment:
   Makes sense. I was thinking of replacing the query id with a constant 
like `HIVE_TEZ_COMMIT_JOB_RESULT` so it would be `HIVE_TEZ_COMMIT_JOB_ID + "." 
+ tableName`, like for job id and task num to make things consisent.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 585660)
Time Spent: 1h  (was: 50m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=585652=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585652
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 20/Apr/21 09:11
Start Date: 20/Apr/21 09:11
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r616497039



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##
@@ -256,4 +265,74 @@ private static PartitionSpec spec(Schema schema, 
Properties properties,
   return PartitionSpec.unpartitioned();
 }
   }
+
+  @Override
+  public void commitInsertTable(org.apache.hadoop.hive.metastore.api.Table 
table, boolean overwrite)
+  throws MetaException {
+String tableName = TableIdentifier.of(table.getDbName(), 
table.getTableName()).toString();
+
+// check status to determine whether we need to commit or to abort
+JobConf jobConf = new JobConf(conf);
+String queryIdKey = jobConf.get("hive.query.id") + "." + tableName + 
".result";

Review comment:
   Maybe even "hive.query.id.%.result"? Or 
"hive.query.id.result."+tableName?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 585652)
Time Spent: 50m  (was: 40m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=585651=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585651
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 20/Apr/21 09:10
Start Date: 20/Apr/21 09:10
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r616496048



##
File path: 
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java
##
@@ -256,4 +265,74 @@ private static PartitionSpec spec(Schema schema, 
Properties properties,
   return PartitionSpec.unpartitioned();
 }
   }
+
+  @Override
+  public void commitInsertTable(org.apache.hadoop.hive.metastore.api.Table 
table, boolean overwrite)
+  throws MetaException {
+String tableName = TableIdentifier.of(table.getDbName(), 
table.getTableName()).toString();
+
+// check status to determine whether we need to commit or to abort
+JobConf jobConf = new JobConf(conf);
+String queryIdKey = jobConf.get("hive.query.id") + "." + tableName + 
".result";

Review comment:
   Could we use constants here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 585651)
Time Spent: 40m  (was: 0.5h)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=585161=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585161
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 19/Apr/21 14:20
Start Date: 19/Apr/21 14:20
Worklog Time Spent: 10m 
  Work Description: marton-bod edited a comment on pull request #2161:
URL: https://github.com/apache/hive/pull/2161#issuecomment-822504138


   This is roughly what the changes would need to look like once we have the 
new Tez version released:
   https://github.com/marton-bod/hive/pull/1
   (using the new Tez API instead of the listing)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 585161)
Time Spent: 0.5h  (was: 20m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=585160=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-585160
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 19/Apr/21 14:19
Start Date: 19/Apr/21 14:19
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on pull request #2161:
URL: https://github.com/apache/hive/pull/2161#issuecomment-822504138


   This is roughly what the changes would need to look like once we have the 
new Tez version released:
   https://github.com/marton-bod/hive/pull/1
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 585160)
Time Spent: 20m  (was: 10m)

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25006) Commit Iceberg writes in HiveMetaHook instead of TezAM

2021-04-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25006?focusedWorklogId=581779=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-581779
 ]

ASF GitHub Bot logged work on HIVE-25006:
-

Author: ASF GitHub Bot
Created on: 13/Apr/21 13:39
Start Date: 13/Apr/21 13:39
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2161:
URL: https://github.com/apache/hive/pull/2161#discussion_r612455851



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezTask.java
##
@@ -250,9 +255,32 @@ public int execute() {
   this.setException(new HiveException(monitor.getDiagnostics()));
 }
 
-// fetch the counters
 try {
   Set statusGetOpts = 
EnumSet.of(StatusGetOpts.GET_COUNTERS);
+  // save useful commit information into session conf, e.g. for custom 
commit hooks
+  List allWork = work.getAllWork();
+  boolean hasReducer = 
allWork.stream().map(workToVertex::get).anyMatch(v -> 
v.getName().startsWith("Reducer"));
+  for (BaseWork baseWork : allWork) {
+Vertex vertex = workToVertex.get(baseWork);
+if (!hasReducer || vertex.getName().startsWith("Reducer")) {
+  // construct the parsable job id
+  VertexStatus status = 
dagClient.getVertexStatus(vertex.getName(), statusGetOpts);
+  String[] jobIdParts = status.getId().split("_");
+  // status.getId() returns something like: 
vertex_1617722404520_0001_1_00
+  // this should be transformed to a parsable JobID: 
job_16177224045200_0001
+  int vertexId = Integer.parseInt(jobIdParts[jobIdParts.length - 
1]);
+  String jobId = String.format("job_%s%d_%s", jobIdParts[1], 
vertexId, jobIdParts[2]);
+  // prefix with table name (for multi-table inserts), if available
+  String tableName = 
Optional.ofNullable(workToConf.get(baseWork)).map(c -> 
c.get("name")).orElse(null);
+  String jobIdKey = HIVE_TEZ_COMMIT_JOB_ID + (tableName == null ? 
"" : "." + tableName);;
+  String taskCountKey = HIVE_TEZ_COMMIT_TASK_COUNT + (tableName == 
null ? "" : "." + tableName);
+  // save info into session conf
+  HiveConf sessionConf = SessionState.get().getConf();
+  sessionConf.set(jobIdKey, jobId);
+  sessionConf.setInt(taskCountKey, 
status.getProgress().getSucceededTaskCount());

Review comment:
   I'll look into this in the following PR, once we've replaced the 
temporary listing solution with the permanent one and upgraded the Tez 
dependency.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 581779)
Remaining Estimate: 0h
Time Spent: 10m

> Commit Iceberg writes in HiveMetaHook instead of TezAM
> --
>
> Key: HIVE-25006
> URL: https://issues.apache.org/jira/browse/HIVE-25006
> Project: Hive
>  Issue Type: Task
>Reporter: Marton Bod
>Assignee: Marton Bod
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Trigger the write commits in the HiveIcebergStorageHandler#commitInsertTable. 
> This will enable us to implement insert overwrites for iceberg tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)