[jira] [Updated] (HIVE-16812) VectorizedOrcAcidRowBatchReader doesn't filter delete events
[ https://issues.apache.org/jira/browse/HIVE-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Koifman updated HIVE-16812: -- Issue Type: Sub-task (was: Improvement) Parent: HIVE-20738 > VectorizedOrcAcidRowBatchReader doesn't filter delete events > > > Key: HIVE-16812 > URL: https://issues.apache.org/jira/browse/HIVE-16812 > Project: Hive > Issue Type: Sub-task > Components: Transactions >Affects Versions: 2.3.0 >Reporter: Eugene Koifman >Assignee: Eugene Koifman >Priority: Critical > Fix For: 4.0.0 > > Attachments: HIVE-16812.02.patch, HIVE-16812.04.patch, > HIVE-16812.05.patch, HIVE-16812.06.patch, HIVE-16812.07.patch > > > the c'tor of VectorizedOrcAcidRowBatchReader has > {noformat} > // Clone readerOptions for deleteEvents. > Reader.Options deleteEventReaderOptions = readerOptions.clone(); > // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX > because > // we always want to read all the delete delta files. > deleteEventReaderOptions.range(0, Long.MAX_VALUE); > {noformat} > This is suboptimal since base and deltas are sorted by ROW__ID. So for each > split if base we can find min/max ROW_ID and only load events from delta that > are in [min,max] range. This will reduce the number of delete events we load > in memory (to no more than there in the split). > When we support sorting on PK, the same should apply but we'd need to make > sure to store PKs in ORC index > See {{OrcRawRecordMerger.discoverKeyBounds()}} > {{hive.acid.key.index}} in Orc footer has an index of ROW__IDs so we should > know min/max easily for any file written by {{OrcRecordUpdater}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-16812) VectorizedOrcAcidRowBatchReader doesn't filter delete events
[ https://issues.apache.org/jira/browse/HIVE-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Koifman updated HIVE-16812: -- Fix Version/s: 4.0.0 Status: Open (was: Patch Available) > VectorizedOrcAcidRowBatchReader doesn't filter delete events > > > Key: HIVE-16812 > URL: https://issues.apache.org/jira/browse/HIVE-16812 > Project: Hive > Issue Type: Improvement > Components: Transactions >Affects Versions: 2.3.0 >Reporter: Eugene Koifman >Assignee: Eugene Koifman >Priority: Critical > Fix For: 4.0.0 > > Attachments: HIVE-16812.02.patch, HIVE-16812.04.patch, > HIVE-16812.05.patch, HIVE-16812.06.patch, HIVE-16812.07.patch > > > the c'tor of VectorizedOrcAcidRowBatchReader has > {noformat} > // Clone readerOptions for deleteEvents. > Reader.Options deleteEventReaderOptions = readerOptions.clone(); > // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX > because > // we always want to read all the delete delta files. > deleteEventReaderOptions.range(0, Long.MAX_VALUE); > {noformat} > This is suboptimal since base and deltas are sorted by ROW__ID. So for each > split if base we can find min/max ROW_ID and only load events from delta that > are in [min,max] range. This will reduce the number of delete events we load > in memory (to no more than there in the split). > When we support sorting on PK, the same should apply but we'd need to make > sure to store PKs in ORC index > See {{OrcRawRecordMerger.discoverKeyBounds()}} > {{hive.acid.key.index}} in Orc footer has an index of ROW__IDs so we should > know min/max easily for any file written by {{OrcRecordUpdater}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-16812) VectorizedOrcAcidRowBatchReader doesn't filter delete events
[ https://issues.apache.org/jira/browse/HIVE-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Koifman updated HIVE-16812: -- Attachment: HIVE-16812.07.patch > VectorizedOrcAcidRowBatchReader doesn't filter delete events > > > Key: HIVE-16812 > URL: https://issues.apache.org/jira/browse/HIVE-16812 > Project: Hive > Issue Type: Improvement > Components: Transactions >Affects Versions: 2.3.0 >Reporter: Eugene Koifman >Assignee: Eugene Koifman >Priority: Critical > Fix For: 4.0.0 > > Attachments: HIVE-16812.02.patch, HIVE-16812.04.patch, > HIVE-16812.05.patch, HIVE-16812.06.patch, HIVE-16812.07.patch > > > the c'tor of VectorizedOrcAcidRowBatchReader has > {noformat} > // Clone readerOptions for deleteEvents. > Reader.Options deleteEventReaderOptions = readerOptions.clone(); > // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX > because > // we always want to read all the delete delta files. > deleteEventReaderOptions.range(0, Long.MAX_VALUE); > {noformat} > This is suboptimal since base and deltas are sorted by ROW__ID. So for each > split if base we can find min/max ROW_ID and only load events from delta that > are in [min,max] range. This will reduce the number of delete events we load > in memory (to no more than there in the split). > When we support sorting on PK, the same should apply but we'd need to make > sure to store PKs in ORC index > See {{OrcRawRecordMerger.discoverKeyBounds()}} > {{hive.acid.key.index}} in Orc footer has an index of ROW__IDs so we should > know min/max easily for any file written by {{OrcRecordUpdater}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-16812) VectorizedOrcAcidRowBatchReader doesn't filter delete events
[ https://issues.apache.org/jira/browse/HIVE-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Koifman updated HIVE-16812: -- Attachment: HIVE-16812.06.patch > VectorizedOrcAcidRowBatchReader doesn't filter delete events > > > Key: HIVE-16812 > URL: https://issues.apache.org/jira/browse/HIVE-16812 > Project: Hive > Issue Type: Improvement > Components: Transactions >Affects Versions: 2.3.0 >Reporter: Eugene Koifman >Assignee: Eugene Koifman >Priority: Critical > Attachments: HIVE-16812.02.patch, HIVE-16812.04.patch, > HIVE-16812.05.patch, HIVE-16812.06.patch > > > the c'tor of VectorizedOrcAcidRowBatchReader has > {noformat} > // Clone readerOptions for deleteEvents. > Reader.Options deleteEventReaderOptions = readerOptions.clone(); > // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX > because > // we always want to read all the delete delta files. > deleteEventReaderOptions.range(0, Long.MAX_VALUE); > {noformat} > This is suboptimal since base and deltas are sorted by ROW__ID. So for each > split if base we can find min/max ROW_ID and only load events from delta that > are in [min,max] range. This will reduce the number of delete events we load > in memory (to no more than there in the split). > When we support sorting on PK, the same should apply but we'd need to make > sure to store PKs in ORC index > See {{OrcRawRecordMerger.discoverKeyBounds()}} > {{hive.acid.key.index}} in Orc footer has an index of ROW__IDs so we should > know min/max easily for any file written by {{OrcRecordUpdater}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-16812) VectorizedOrcAcidRowBatchReader doesn't filter delete events
[ https://issues.apache.org/jira/browse/HIVE-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Koifman updated HIVE-16812: -- Attachment: HIVE-16812.05.patch > VectorizedOrcAcidRowBatchReader doesn't filter delete events > > > Key: HIVE-16812 > URL: https://issues.apache.org/jira/browse/HIVE-16812 > Project: Hive > Issue Type: Improvement > Components: Transactions >Affects Versions: 2.3.0 >Reporter: Eugene Koifman >Assignee: Eugene Koifman >Priority: Critical > Attachments: HIVE-16812.02.patch, HIVE-16812.04.patch, > HIVE-16812.05.patch > > > the c'tor of VectorizedOrcAcidRowBatchReader has > {noformat} > // Clone readerOptions for deleteEvents. > Reader.Options deleteEventReaderOptions = readerOptions.clone(); > // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX > because > // we always want to read all the delete delta files. > deleteEventReaderOptions.range(0, Long.MAX_VALUE); > {noformat} > This is suboptimal since base and deltas are sorted by ROW__ID. So for each > split if base we can find min/max ROW_ID and only load events from delta that > are in [min,max] range. This will reduce the number of delete events we load > in memory (to no more than there in the split). > When we support sorting on PK, the same should apply but we'd need to make > sure to store PKs in ORC index > See {{OrcRawRecordMerger.discoverKeyBounds()}} > {{hive.acid.key.index}} in Orc footer has an index of ROW__IDs so we should > know min/max easily for any file written by {{OrcRecordUpdater}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-16812) VectorizedOrcAcidRowBatchReader doesn't filter delete events
[ https://issues.apache.org/jira/browse/HIVE-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Koifman updated HIVE-16812: -- Attachment: HIVE-16812.04.patch > VectorizedOrcAcidRowBatchReader doesn't filter delete events > > > Key: HIVE-16812 > URL: https://issues.apache.org/jira/browse/HIVE-16812 > Project: Hive > Issue Type: Improvement > Components: Transactions >Affects Versions: 2.3.0 >Reporter: Eugene Koifman >Assignee: Eugene Koifman >Priority: Critical > Attachments: HIVE-16812.02.patch, HIVE-16812.04.patch > > > the c'tor of VectorizedOrcAcidRowBatchReader has > {noformat} > // Clone readerOptions for deleteEvents. > Reader.Options deleteEventReaderOptions = readerOptions.clone(); > // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX > because > // we always want to read all the delete delta files. > deleteEventReaderOptions.range(0, Long.MAX_VALUE); > {noformat} > This is suboptimal since base and deltas are sorted by ROW__ID. So for each > split if base we can find min/max ROW_ID and only load events from delta that > are in [min,max] range. This will reduce the number of delete events we load > in memory (to no more than there in the split). > When we support sorting on PK, the same should apply but we'd need to make > sure to store PKs in ORC index > See {{OrcRawRecordMerger.discoverKeyBounds()}} > {{hive.acid.key.index}} in Orc footer has an index of ROW__IDs so we should > know min/max easily for any file written by {{OrcRecordUpdater}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-16812) VectorizedOrcAcidRowBatchReader doesn't filter delete events
[ https://issues.apache.org/jira/browse/HIVE-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Koifman updated HIVE-16812: -- Attachment: HIVE-16812.02.patch > VectorizedOrcAcidRowBatchReader doesn't filter delete events > > > Key: HIVE-16812 > URL: https://issues.apache.org/jira/browse/HIVE-16812 > Project: Hive > Issue Type: Improvement > Components: Transactions >Affects Versions: 2.3.0 >Reporter: Eugene Koifman >Assignee: Eugene Koifman >Priority: Critical > Attachments: HIVE-16812.02.patch > > > the c'tor of VectorizedOrcAcidRowBatchReader has > {noformat} > // Clone readerOptions for deleteEvents. > Reader.Options deleteEventReaderOptions = readerOptions.clone(); > // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX > because > // we always want to read all the delete delta files. > deleteEventReaderOptions.range(0, Long.MAX_VALUE); > {noformat} > This is suboptimal since base and deltas are sorted by ROW__ID. So for each > split if base we can find min/max ROW_ID and only load events from delta that > are in [min,max] range. This will reduce the number of delete events we load > in memory (to no more than there in the split). > When we support sorting on PK, the same should apply but we'd need to make > sure to store PKs in ORC index > See {{OrcRawRecordMerger.discoverKeyBounds()}} > {{hive.acid.key.index}} in Orc footer has an index of ROW__IDs so we should > know min/max easily for any file written by {{OrcRecordUpdater}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-16812) VectorizedOrcAcidRowBatchReader doesn't filter delete events
[ https://issues.apache.org/jira/browse/HIVE-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Koifman updated HIVE-16812: -- Status: Patch Available (was: Open) > VectorizedOrcAcidRowBatchReader doesn't filter delete events > > > Key: HIVE-16812 > URL: https://issues.apache.org/jira/browse/HIVE-16812 > Project: Hive > Issue Type: Improvement > Components: Transactions >Affects Versions: 2.3.0 >Reporter: Eugene Koifman >Assignee: Eugene Koifman >Priority: Critical > Attachments: HIVE-16812.02.patch > > > the c'tor of VectorizedOrcAcidRowBatchReader has > {noformat} > // Clone readerOptions for deleteEvents. > Reader.Options deleteEventReaderOptions = readerOptions.clone(); > // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX > because > // we always want to read all the delete delta files. > deleteEventReaderOptions.range(0, Long.MAX_VALUE); > {noformat} > This is suboptimal since base and deltas are sorted by ROW__ID. So for each > split if base we can find min/max ROW_ID and only load events from delta that > are in [min,max] range. This will reduce the number of delete events we load > in memory (to no more than there in the split). > When we support sorting on PK, the same should apply but we'd need to make > sure to store PKs in ORC index > See {{OrcRawRecordMerger.discoverKeyBounds()}} > {{hive.acid.key.index}} in Orc footer has an index of ROW__IDs so we should > know min/max easily for any file written by {{OrcRecordUpdater}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-16812) VectorizedOrcAcidRowBatchReader doesn't filter delete events
[ https://issues.apache.org/jira/browse/HIVE-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Koifman updated HIVE-16812: -- Description: the c'tor of VectorizedOrcAcidRowBatchReader has {noformat} // Clone readerOptions for deleteEvents. Reader.Options deleteEventReaderOptions = readerOptions.clone(); // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX because // we always want to read all the delete delta files. deleteEventReaderOptions.range(0, Long.MAX_VALUE); {noformat} This is suboptimal since base and deltas are sorted by ROW__ID. So for each split if base we can find min/max ROW_ID and only load events from delta that are in [min,max] range. This will reduce the number of delete events we load in memory (to no more than there in the split). When we support sorting on PK, the same should apply but we'd need to make sure to store PKs in ORC index See {{OrcRawRecordMerger.discoverKeyBounds()}} {{hive.acid.key.index}} in Orc footer has an index of ROW__IDs so we should know min/max easily for any file written by {{OrcRecordUpdater}} was: the c'tor of VectorizedOrcAcidRowBatchReader has {noformat} // Clone readerOptions for deleteEvents. Reader.Options deleteEventReaderOptions = readerOptions.clone(); // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX because // we always want to read all the delete delta files. deleteEventReaderOptions.range(0, Long.MAX_VALUE); {noformat} This is suboptimal since base and deltas are sorted by ROW__ID. So for each split if base we can find min/max ROW_ID and only load events from delta that are in [min,max] range. This will reduce the number of delete events we load in memory (to no more than there in the split). When we support sorting on PK, the same should apply but we'd need to make sure to store PKs in ORC index See OrcRawRecordMerger.discoverKeyBounds() > VectorizedOrcAcidRowBatchReader doesn't filter delete events > > > Key: HIVE-16812 > URL: https://issues.apache.org/jira/browse/HIVE-16812 > Project: Hive > Issue Type: Improvement > Components: Transactions >Affects Versions: 2.3.0 >Reporter: Eugene Koifman >Assignee: Eugene Koifman >Priority: Critical > > the c'tor of VectorizedOrcAcidRowBatchReader has > {noformat} > // Clone readerOptions for deleteEvents. > Reader.Options deleteEventReaderOptions = readerOptions.clone(); > // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX > because > // we always want to read all the delete delta files. > deleteEventReaderOptions.range(0, Long.MAX_VALUE); > {noformat} > This is suboptimal since base and deltas are sorted by ROW__ID. So for each > split if base we can find min/max ROW_ID and only load events from delta that > are in [min,max] range. This will reduce the number of delete events we load > in memory (to no more than there in the split). > When we support sorting on PK, the same should apply but we'd need to make > sure to store PKs in ORC index > See {{OrcRawRecordMerger.discoverKeyBounds()}} > {{hive.acid.key.index}} in Orc footer has an index of ROW__IDs so we should > know min/max easily for any file written by {{OrcRecordUpdater}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-16812) VectorizedOrcAcidRowBatchReader doesn't filter delete events
[ https://issues.apache.org/jira/browse/HIVE-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Koifman updated HIVE-16812: -- Priority: Critical (was: Major) > VectorizedOrcAcidRowBatchReader doesn't filter delete events > > > Key: HIVE-16812 > URL: https://issues.apache.org/jira/browse/HIVE-16812 > Project: Hive > Issue Type: Improvement > Components: Transactions >Affects Versions: 2.3.0 >Reporter: Eugene Koifman >Assignee: Eugene Koifman >Priority: Critical > > the c'tor of VectorizedOrcAcidRowBatchReader has > {noformat} > // Clone readerOptions for deleteEvents. > Reader.Options deleteEventReaderOptions = readerOptions.clone(); > // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX > because > // we always want to read all the delete delta files. > deleteEventReaderOptions.range(0, Long.MAX_VALUE); > {noformat} > This is suboptimal since base and deltas are sorted by ROW__ID. So for each > split if base we can find min/max ROW_ID and only load events from delta that > are in [min,max] range. This will reduce the number of delete events we load > in memory (to no more than there in the split). > When we support sorting on PK, the same should apply but we'd need to make > sure to store PKs in ORC index > See OrcRawRecordMerger.discoverKeyBounds() -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-16812) VectorizedOrcAcidRowBatchReader doesn't filter delete events
[ https://issues.apache.org/jira/browse/HIVE-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eugene Koifman updated HIVE-16812: -- Description: the c'tor of VectorizedOrcAcidRowBatchReader has {noformat} // Clone readerOptions for deleteEvents. Reader.Options deleteEventReaderOptions = readerOptions.clone(); // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX because // we always want to read all the delete delta files. deleteEventReaderOptions.range(0, Long.MAX_VALUE); {noformat} This is suboptimal since base and deltas are sorted by ROW__ID. So for each split if base we can find min/max ROW_ID and only load events from delta that are in [min,max] range. This will reduce the number of delete events we load in memory (to no more than there in the split). When we support sorting on PK, the same should apply but we'd need to make sure to store PKs in ORC index See OrcRawRecordMerger.discoverKeyBounds() was: the c'tor of VectorizedOrcAcidRowBatchReader has {noformat} // Clone readerOptions for deleteEvents. Reader.Options deleteEventReaderOptions = readerOptions.clone(); // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX because // we always want to read all the delete delta files. deleteEventReaderOptions.range(0, Long.MAX_VALUE); {noformat} This is suboptimal since base and deltas are sorted by ROW__ID. So for each split if base we can find min/max ROW_ID and only load events from delta that are in [min,max] range. This will reduce the number of delete events we load in memory (to no more than there in the split). When we support sorting on PK, the same should apply but we'd need to make sure to store PKs in ORC index > VectorizedOrcAcidRowBatchReader doesn't filter delete events > > > Key: HIVE-16812 > URL: https://issues.apache.org/jira/browse/HIVE-16812 > Project: Hive > Issue Type: Improvement > Components: Transactions >Affects Versions: 2.3.0 >Reporter: Eugene Koifman >Assignee: Eugene Koifman > > the c'tor of VectorizedOrcAcidRowBatchReader has > {noformat} > // Clone readerOptions for deleteEvents. > Reader.Options deleteEventReaderOptions = readerOptions.clone(); > // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX > because > // we always want to read all the delete delta files. > deleteEventReaderOptions.range(0, Long.MAX_VALUE); > {noformat} > This is suboptimal since base and deltas are sorted by ROW__ID. So for each > split if base we can find min/max ROW_ID and only load events from delta that > are in [min,max] range. This will reduce the number of delete events we load > in memory (to no more than there in the split). > When we support sorting on PK, the same should apply but we'd need to make > sure to store PKs in ORC index > See OrcRawRecordMerger.discoverKeyBounds() -- This message was sent by Atlassian JIRA (v6.3.15#6346)