[jira] [Work logged] (HIVE-23840) Use LLAP to get orc metadata

2020-07-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23840?focusedWorklogId=460984=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-460984
 ]

ASF GitHub Bot logged work on HIVE-23840:
-

Author: ASF GitHub Bot
Created on: 20/Jul/20 09:32
Start Date: 20/Jul/20 09:32
Worklog Time Spent: 10m 
  Work Description: pvary merged pull request #1251:
URL: https://github.com/apache/hive/pull/1251


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 460984)
Time Spent: 1h 10m  (was: 1h)

> Use LLAP to get orc metadata
> 
>
> Key: HIVE-23840
> URL: https://issues.apache.org/jira/browse/HIVE-23840
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Peter Vary
>Assignee: Peter Vary
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> HIVE-23824 added the possibility to access ORC metadata. We can use this to 
> decide which delta files should be read, and which could be omitted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23840) Use LLAP to get orc metadata

2020-07-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23840?focusedWorklogId=458894=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-458894
 ]

ASF GitHub Bot logged work on HIVE-23840:
-

Author: ASF GitHub Bot
Created on: 14/Jul/20 19:48
Start Date: 14/Jul/20 19:48
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #1251:
URL: https://github.com/apache/hive/pull/1251#discussion_r454602904



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -129,6 +137,16 @@
*/
   private SearchArgument deleteEventSarg = null;
 
+  /**
+   * Cachetag associated with the Split
+   */
+  private final CacheTag cacheTag;
+
+  /**
+   * Skip using Llap IO cache for checking delete_delta files if the 
configuration is not correct
+   */
+  private static boolean skipLlapCache = true;

Review comment:
   That was a mistake. Corrected, and initialized as false





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 458894)
Time Spent: 50m  (was: 40m)

> Use LLAP to get orc metadata
> 
>
> Key: HIVE-23840
> URL: https://issues.apache.org/jira/browse/HIVE-23840
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Peter Vary
>Assignee: Peter Vary
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> HIVE-23824 added the possibility to access ORC metadata. We can use this to 
> decide which delta files should be read, and which could be omitted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23840) Use LLAP to get orc metadata

2020-07-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23840?focusedWorklogId=458893=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-458893
 ]

ASF GitHub Bot logged work on HIVE-23840:
-

Author: ASF GitHub Bot
Created on: 14/Jul/20 19:48
Start Date: 14/Jul/20 19:48
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #1251:
URL: https://github.com/apache/hive/pull/1251#discussion_r454602727



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -232,6 +250,17 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, 
OrcSplit orcSplit, Reporte
 
 this.syntheticProps = orcSplit.getSyntheticAcidProps();
 
+if (LlapHiveUtils.isLlapMode(conf) && LlapProxy.isDaemon()
+&& HiveConf.getBoolVar(conf, ConfVars.LLAP_TRACK_CACHE_USAGE))
+{
+  MapWork mapWork = LlapHiveUtils.findMapWork(conf);

Review comment:
   Good idea, done!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 458893)
Time Spent: 40m  (was: 0.5h)

> Use LLAP to get orc metadata
> 
>
> Key: HIVE-23840
> URL: https://issues.apache.org/jira/browse/HIVE-23840
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Peter Vary
>Assignee: Peter Vary
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> HIVE-23824 added the possibility to access ORC metadata. We can use this to 
> decide which delta files should be read, and which could be omitted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23840) Use LLAP to get orc metadata

2020-07-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23840?focusedWorklogId=458895=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-458895
 ]

ASF GitHub Bot logged work on HIVE-23840:
-

Author: ASF GitHub Bot
Created on: 14/Jul/20 19:48
Start Date: 14/Jul/20 19:48
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #1251:
URL: https://github.com/apache/hive/pull/1251#discussion_r454603042



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -1562,20 +1580,31 @@ public int compareTo(CompressedOwid other) {
   try {
 final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit);
 if (deleteDeltaDirs.length > 0) {
+  FileSystem fs = orcSplit.getPath().getFileSystem(conf);
+  AcidOutputFormat.Options orcSplitMinMaxWriteIds =
+  AcidUtils.parseBaseOrDeltaBucketFilename(orcSplit.getPath(), 
conf);
   int totalDeleteEventCount = 0;
   for (Path deleteDeltaDir : deleteDeltaDirs) {
-FileSystem fs = deleteDeltaDir.getFileSystem(conf);
+if (!isQualifiedDeleteDeltaForSplit(orcSplitMinMaxWriteIds, 
deleteDeltaDir)) {
+  continue;
+}
 Path[] deleteDeltaFiles = 
OrcRawRecordMerger.getDeltaFiles(deleteDeltaDir, bucket,
 new OrcRawRecordMerger.Options().isCompacting(false), null);
 for (Path deleteDeltaFile : deleteDeltaFiles) {
   try {
-/**
- * todo: we have OrcSplit.orcTail so we should be able to get 
stats from there
- */
-Reader deleteDeltaReader = 
OrcFile.createReader(deleteDeltaFile, OrcFile.readerOptions(conf));
-if (deleteDeltaReader.getNumberOfRows() <= 0) {
+ReaderData readerData = getOrcTail(deleteDeltaFile, conf, 
cacheTag);
+OrcTail orcTail = readerData.orcTail;
+if (orcTail.getFooter().getNumberOfRows() <= 0) {
   continue; // just a safe check to ensure that we are not 
reading empty delete files.
 }
+OrcRawRecordMerger.KeyInterval deleteKeyInterval = 
findDeleteMinMaxKeys(orcTail, deleteDeltaFile);
+if (!deleteKeyInterval.isIntersects(keyInterval)) {
+  // If there is no intersection between data and delete 
delta, do not read delete file
+  continue;
+}
+// Create the reader if we got the OrcTail from cache

Review comment:
   Added more comment





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 458895)
Time Spent: 1h  (was: 50m)

> Use LLAP to get orc metadata
> 
>
> Key: HIVE-23840
> URL: https://issues.apache.org/jira/browse/HIVE-23840
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Peter Vary
>Assignee: Peter Vary
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> HIVE-23824 added the possibility to access ORC metadata. We can use this to 
> decide which delta files should be read, and which could be omitted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23840) Use LLAP to get orc metadata

2020-07-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23840?focusedWorklogId=458681=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-458681
 ]

ASF GitHub Bot logged work on HIVE-23840:
-

Author: ASF GitHub Bot
Created on: 14/Jul/20 14:38
Start Date: 14/Jul/20 14:38
Worklog Time Spent: 10m 
  Work Description: szlta commented on a change in pull request #1251:
URL: https://github.com/apache/hive/pull/1251#discussion_r454393621



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -129,6 +137,16 @@
*/
   private SearchArgument deleteEventSarg = null;
 
+  /**
+   * Cachetag associated with the Split
+   */
+  private final CacheTag cacheTag;
+
+  /**
+   * Skip using Llap IO cache for checking delete_delta files if the 
configuration is not correct
+   */
+  private static boolean skipLlapCache = true;

Review comment:
   Initialized to true on purpose for now? If not, I don't see it getting 
set to false.

##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -1562,20 +1580,31 @@ public int compareTo(CompressedOwid other) {
   try {
 final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit);
 if (deleteDeltaDirs.length > 0) {
+  FileSystem fs = orcSplit.getPath().getFileSystem(conf);
+  AcidOutputFormat.Options orcSplitMinMaxWriteIds =
+  AcidUtils.parseBaseOrDeltaBucketFilename(orcSplit.getPath(), 
conf);
   int totalDeleteEventCount = 0;
   for (Path deleteDeltaDir : deleteDeltaDirs) {
-FileSystem fs = deleteDeltaDir.getFileSystem(conf);
+if (!isQualifiedDeleteDeltaForSplit(orcSplitMinMaxWriteIds, 
deleteDeltaDir)) {
+  continue;
+}
 Path[] deleteDeltaFiles = 
OrcRawRecordMerger.getDeltaFiles(deleteDeltaDir, bucket,
 new OrcRawRecordMerger.Options().isCompacting(false), null);
 for (Path deleteDeltaFile : deleteDeltaFiles) {
   try {
-/**
- * todo: we have OrcSplit.orcTail so we should be able to get 
stats from there
- */
-Reader deleteDeltaReader = 
OrcFile.createReader(deleteDeltaFile, OrcFile.readerOptions(conf));
-if (deleteDeltaReader.getNumberOfRows() <= 0) {
+ReaderData readerData = getOrcTail(deleteDeltaFile, conf, 
cacheTag);
+OrcTail orcTail = readerData.orcTail;
+if (orcTail.getFooter().getNumberOfRows() <= 0) {
   continue; // just a safe check to ensure that we are not 
reading empty delete files.
 }
+OrcRawRecordMerger.KeyInterval deleteKeyInterval = 
findDeleteMinMaxKeys(orcTail, deleteDeltaFile);
+if (!deleteKeyInterval.isIntersects(keyInterval)) {
+  // If there is no intersection between data and delete 
delta, do not read delete file
+  continue;
+}
+// Create the reader if we got the OrcTail from cache

Review comment:
   nit: comment could be more verbose, like: Reader can be reused if it was 
created before: only for non-LLAP cache cases, otherwise we need to create it 
here





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 458681)
Time Spent: 0.5h  (was: 20m)

> Use LLAP to get orc metadata
> 
>
> Key: HIVE-23840
> URL: https://issues.apache.org/jira/browse/HIVE-23840
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Peter Vary
>Assignee: Peter Vary
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> HIVE-23824 added the possibility to access ORC metadata. We can use this to 
> decide which delta files should be read, and which could be omitted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23840) Use LLAP to get orc metadata

2020-07-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23840?focusedWorklogId=458652=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-458652
 ]

ASF GitHub Bot logged work on HIVE-23840:
-

Author: ASF GitHub Bot
Created on: 14/Jul/20 14:18
Start Date: 14/Jul/20 14:18
Worklog Time Spent: 10m 
  Work Description: szlta commented on a change in pull request #1251:
URL: https://github.com/apache/hive/pull/1251#discussion_r454390429



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##
@@ -232,6 +250,17 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, 
OrcSplit orcSplit, Reporte
 
 this.syntheticProps = orcSplit.getSyntheticAcidProps();
 
+if (LlapHiveUtils.isLlapMode(conf) && LlapProxy.isDaemon()
+&& HiveConf.getBoolVar(conf, ConfVars.LLAP_TRACK_CACHE_USAGE))
+{
+  MapWork mapWork = LlapHiveUtils.findMapWork(conf);

Review comment:
   We could spare the deserialization of MapWork from JobConf here, if we 
pass the MapWork instance already present in LlapRecordReader to 
VectorizedOrcAcidRowBatchReader ctor. (Downside is that in turn we would need 
to adjust the other ctor's of VectorizedOrcAcidRowBatchReader too)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 458652)
Time Spent: 20m  (was: 10m)

> Use LLAP to get orc metadata
> 
>
> Key: HIVE-23840
> URL: https://issues.apache.org/jira/browse/HIVE-23840
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Peter Vary
>Assignee: Peter Vary
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> HIVE-23824 added the possibility to access ORC metadata. We can use this to 
> decide which delta files should be read, and which could be omitted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-23840) Use LLAP to get orc metadata

2020-07-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-23840?focusedWorklogId=458543=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-458543
 ]

ASF GitHub Bot logged work on HIVE-23840:
-

Author: ASF GitHub Bot
Created on: 14/Jul/20 09:48
Start Date: 14/Jul/20 09:48
Worklog Time Spent: 10m 
  Work Description: pvary opened a new pull request #1251:
URL: https://github.com/apache/hive/pull/1251


   Started to use new LLAP getOrcTailFromCache
   Refactored stuff to use the tail instead of the reader related things
   Added some unit tests for the new smaller components



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 458543)
Remaining Estimate: 0h
Time Spent: 10m

> Use LLAP to get orc metadata
> 
>
> Key: HIVE-23840
> URL: https://issues.apache.org/jira/browse/HIVE-23840
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Peter Vary
>Assignee: Peter Vary
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> HIVE-23824 added the possibility to access ORC metadata. We can use this to 
> decide which delta files should be read, and which could be omitted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)