[jira] [Work logged] (HIVE-23840) Use LLAP to get orc metadata
[ https://issues.apache.org/jira/browse/HIVE-23840?focusedWorklogId=460984=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-460984 ] ASF GitHub Bot logged work on HIVE-23840: - Author: ASF GitHub Bot Created on: 20/Jul/20 09:32 Start Date: 20/Jul/20 09:32 Worklog Time Spent: 10m Work Description: pvary merged pull request #1251: URL: https://github.com/apache/hive/pull/1251 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 460984) Time Spent: 1h 10m (was: 1h) > Use LLAP to get orc metadata > > > Key: HIVE-23840 > URL: https://issues.apache.org/jira/browse/HIVE-23840 > Project: Hive > Issue Type: Improvement > Components: Transactions >Reporter: Peter Vary >Assignee: Peter Vary >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > HIVE-23824 added the possibility to access ORC metadata. We can use this to > decide which delta files should be read, and which could be omitted. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23840) Use LLAP to get orc metadata
[ https://issues.apache.org/jira/browse/HIVE-23840?focusedWorklogId=458894=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-458894 ] ASF GitHub Bot logged work on HIVE-23840: - Author: ASF GitHub Bot Created on: 14/Jul/20 19:48 Start Date: 14/Jul/20 19:48 Worklog Time Spent: 10m Work Description: pvary commented on a change in pull request #1251: URL: https://github.com/apache/hive/pull/1251#discussion_r454602904 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -129,6 +137,16 @@ */ private SearchArgument deleteEventSarg = null; + /** + * Cachetag associated with the Split + */ + private final CacheTag cacheTag; + + /** + * Skip using Llap IO cache for checking delete_delta files if the configuration is not correct + */ + private static boolean skipLlapCache = true; Review comment: That was a mistake. Corrected, and initialized as false This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 458894) Time Spent: 50m (was: 40m) > Use LLAP to get orc metadata > > > Key: HIVE-23840 > URL: https://issues.apache.org/jira/browse/HIVE-23840 > Project: Hive > Issue Type: Improvement > Components: Transactions >Reporter: Peter Vary >Assignee: Peter Vary >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > HIVE-23824 added the possibility to access ORC metadata. We can use this to > decide which delta files should be read, and which could be omitted. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23840) Use LLAP to get orc metadata
[ https://issues.apache.org/jira/browse/HIVE-23840?focusedWorklogId=458893=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-458893 ] ASF GitHub Bot logged work on HIVE-23840: - Author: ASF GitHub Bot Created on: 14/Jul/20 19:48 Start Date: 14/Jul/20 19:48 Worklog Time Spent: 10m Work Description: pvary commented on a change in pull request #1251: URL: https://github.com/apache/hive/pull/1251#discussion_r454602727 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -232,6 +250,17 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, OrcSplit orcSplit, Reporte this.syntheticProps = orcSplit.getSyntheticAcidProps(); +if (LlapHiveUtils.isLlapMode(conf) && LlapProxy.isDaemon() +&& HiveConf.getBoolVar(conf, ConfVars.LLAP_TRACK_CACHE_USAGE)) +{ + MapWork mapWork = LlapHiveUtils.findMapWork(conf); Review comment: Good idea, done! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 458893) Time Spent: 40m (was: 0.5h) > Use LLAP to get orc metadata > > > Key: HIVE-23840 > URL: https://issues.apache.org/jira/browse/HIVE-23840 > Project: Hive > Issue Type: Improvement > Components: Transactions >Reporter: Peter Vary >Assignee: Peter Vary >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > HIVE-23824 added the possibility to access ORC metadata. We can use this to > decide which delta files should be read, and which could be omitted. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23840) Use LLAP to get orc metadata
[ https://issues.apache.org/jira/browse/HIVE-23840?focusedWorklogId=458895=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-458895 ] ASF GitHub Bot logged work on HIVE-23840: - Author: ASF GitHub Bot Created on: 14/Jul/20 19:48 Start Date: 14/Jul/20 19:48 Worklog Time Spent: 10m Work Description: pvary commented on a change in pull request #1251: URL: https://github.com/apache/hive/pull/1251#discussion_r454603042 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -1562,20 +1580,31 @@ public int compareTo(CompressedOwid other) { try { final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit); if (deleteDeltaDirs.length > 0) { + FileSystem fs = orcSplit.getPath().getFileSystem(conf); + AcidOutputFormat.Options orcSplitMinMaxWriteIds = + AcidUtils.parseBaseOrDeltaBucketFilename(orcSplit.getPath(), conf); int totalDeleteEventCount = 0; for (Path deleteDeltaDir : deleteDeltaDirs) { -FileSystem fs = deleteDeltaDir.getFileSystem(conf); +if (!isQualifiedDeleteDeltaForSplit(orcSplitMinMaxWriteIds, deleteDeltaDir)) { + continue; +} Path[] deleteDeltaFiles = OrcRawRecordMerger.getDeltaFiles(deleteDeltaDir, bucket, new OrcRawRecordMerger.Options().isCompacting(false), null); for (Path deleteDeltaFile : deleteDeltaFiles) { try { -/** - * todo: we have OrcSplit.orcTail so we should be able to get stats from there - */ -Reader deleteDeltaReader = OrcFile.createReader(deleteDeltaFile, OrcFile.readerOptions(conf)); -if (deleteDeltaReader.getNumberOfRows() <= 0) { +ReaderData readerData = getOrcTail(deleteDeltaFile, conf, cacheTag); +OrcTail orcTail = readerData.orcTail; +if (orcTail.getFooter().getNumberOfRows() <= 0) { continue; // just a safe check to ensure that we are not reading empty delete files. } +OrcRawRecordMerger.KeyInterval deleteKeyInterval = findDeleteMinMaxKeys(orcTail, deleteDeltaFile); +if (!deleteKeyInterval.isIntersects(keyInterval)) { + // If there is no intersection between data and delete delta, do not read delete file + continue; +} +// Create the reader if we got the OrcTail from cache Review comment: Added more comment This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 458895) Time Spent: 1h (was: 50m) > Use LLAP to get orc metadata > > > Key: HIVE-23840 > URL: https://issues.apache.org/jira/browse/HIVE-23840 > Project: Hive > Issue Type: Improvement > Components: Transactions >Reporter: Peter Vary >Assignee: Peter Vary >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > HIVE-23824 added the possibility to access ORC metadata. We can use this to > decide which delta files should be read, and which could be omitted. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23840) Use LLAP to get orc metadata
[ https://issues.apache.org/jira/browse/HIVE-23840?focusedWorklogId=458681=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-458681 ] ASF GitHub Bot logged work on HIVE-23840: - Author: ASF GitHub Bot Created on: 14/Jul/20 14:38 Start Date: 14/Jul/20 14:38 Worklog Time Spent: 10m Work Description: szlta commented on a change in pull request #1251: URL: https://github.com/apache/hive/pull/1251#discussion_r454393621 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -129,6 +137,16 @@ */ private SearchArgument deleteEventSarg = null; + /** + * Cachetag associated with the Split + */ + private final CacheTag cacheTag; + + /** + * Skip using Llap IO cache for checking delete_delta files if the configuration is not correct + */ + private static boolean skipLlapCache = true; Review comment: Initialized to true on purpose for now? If not, I don't see it getting set to false. ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -1562,20 +1580,31 @@ public int compareTo(CompressedOwid other) { try { final Path[] deleteDeltaDirs = getDeleteDeltaDirsFromSplit(orcSplit); if (deleteDeltaDirs.length > 0) { + FileSystem fs = orcSplit.getPath().getFileSystem(conf); + AcidOutputFormat.Options orcSplitMinMaxWriteIds = + AcidUtils.parseBaseOrDeltaBucketFilename(orcSplit.getPath(), conf); int totalDeleteEventCount = 0; for (Path deleteDeltaDir : deleteDeltaDirs) { -FileSystem fs = deleteDeltaDir.getFileSystem(conf); +if (!isQualifiedDeleteDeltaForSplit(orcSplitMinMaxWriteIds, deleteDeltaDir)) { + continue; +} Path[] deleteDeltaFiles = OrcRawRecordMerger.getDeltaFiles(deleteDeltaDir, bucket, new OrcRawRecordMerger.Options().isCompacting(false), null); for (Path deleteDeltaFile : deleteDeltaFiles) { try { -/** - * todo: we have OrcSplit.orcTail so we should be able to get stats from there - */ -Reader deleteDeltaReader = OrcFile.createReader(deleteDeltaFile, OrcFile.readerOptions(conf)); -if (deleteDeltaReader.getNumberOfRows() <= 0) { +ReaderData readerData = getOrcTail(deleteDeltaFile, conf, cacheTag); +OrcTail orcTail = readerData.orcTail; +if (orcTail.getFooter().getNumberOfRows() <= 0) { continue; // just a safe check to ensure that we are not reading empty delete files. } +OrcRawRecordMerger.KeyInterval deleteKeyInterval = findDeleteMinMaxKeys(orcTail, deleteDeltaFile); +if (!deleteKeyInterval.isIntersects(keyInterval)) { + // If there is no intersection between data and delete delta, do not read delete file + continue; +} +// Create the reader if we got the OrcTail from cache Review comment: nit: comment could be more verbose, like: Reader can be reused if it was created before: only for non-LLAP cache cases, otherwise we need to create it here This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 458681) Time Spent: 0.5h (was: 20m) > Use LLAP to get orc metadata > > > Key: HIVE-23840 > URL: https://issues.apache.org/jira/browse/HIVE-23840 > Project: Hive > Issue Type: Improvement > Components: Transactions >Reporter: Peter Vary >Assignee: Peter Vary >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > HIVE-23824 added the possibility to access ORC metadata. We can use this to > decide which delta files should be read, and which could be omitted. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23840) Use LLAP to get orc metadata
[ https://issues.apache.org/jira/browse/HIVE-23840?focusedWorklogId=458652=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-458652 ] ASF GitHub Bot logged work on HIVE-23840: - Author: ASF GitHub Bot Created on: 14/Jul/20 14:18 Start Date: 14/Jul/20 14:18 Worklog Time Spent: 10m Work Description: szlta commented on a change in pull request #1251: URL: https://github.com/apache/hive/pull/1251#discussion_r454390429 ## File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java ## @@ -232,6 +250,17 @@ private VectorizedOrcAcidRowBatchReader(JobConf conf, OrcSplit orcSplit, Reporte this.syntheticProps = orcSplit.getSyntheticAcidProps(); +if (LlapHiveUtils.isLlapMode(conf) && LlapProxy.isDaemon() +&& HiveConf.getBoolVar(conf, ConfVars.LLAP_TRACK_CACHE_USAGE)) +{ + MapWork mapWork = LlapHiveUtils.findMapWork(conf); Review comment: We could spare the deserialization of MapWork from JobConf here, if we pass the MapWork instance already present in LlapRecordReader to VectorizedOrcAcidRowBatchReader ctor. (Downside is that in turn we would need to adjust the other ctor's of VectorizedOrcAcidRowBatchReader too) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 458652) Time Spent: 20m (was: 10m) > Use LLAP to get orc metadata > > > Key: HIVE-23840 > URL: https://issues.apache.org/jira/browse/HIVE-23840 > Project: Hive > Issue Type: Improvement > Components: Transactions >Reporter: Peter Vary >Assignee: Peter Vary >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > HIVE-23824 added the possibility to access ORC metadata. We can use this to > decide which delta files should be read, and which could be omitted. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23840) Use LLAP to get orc metadata
[ https://issues.apache.org/jira/browse/HIVE-23840?focusedWorklogId=458543=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-458543 ] ASF GitHub Bot logged work on HIVE-23840: - Author: ASF GitHub Bot Created on: 14/Jul/20 09:48 Start Date: 14/Jul/20 09:48 Worklog Time Spent: 10m Work Description: pvary opened a new pull request #1251: URL: https://github.com/apache/hive/pull/1251 Started to use new LLAP getOrcTailFromCache Refactored stuff to use the tail instead of the reader related things Added some unit tests for the new smaller components This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 458543) Remaining Estimate: 0h Time Spent: 10m > Use LLAP to get orc metadata > > > Key: HIVE-23840 > URL: https://issues.apache.org/jira/browse/HIVE-23840 > Project: Hive > Issue Type: Improvement > Components: Transactions >Reporter: Peter Vary >Assignee: Peter Vary >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > HIVE-23824 added the possibility to access ORC metadata. We can use this to > decide which delta files should be read, and which could be omitted. -- This message was sent by Atlassian Jira (v8.3.4#803005)