[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS
[ https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519448=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519448 ] ASF GitHub Bot logged work on HIVE-2: - Author: ASF GitHub Bot Created on: 03/Dec/20 07:41 Start Date: 03/Dec/20 07:41 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1716: URL: https://github.com/apache/hive/pull/1716#discussion_r534827989 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java ## @@ -316,6 +314,30 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa } fs.delete(dead, true); } -return true; +// Check if there will be more obsolete directories to clean when possible. We will only mark cleaned when this +// number reaches 0. +return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0; + } + + /** + * Get the number of base/delta directories the Cleaner should remove eventually. If we check this after cleaning + * we can see if the Cleaner has further work to do in this table/partition directory that it hasn't been able to + * finish, e.g. because of an open transaction at the time of compaction. + * We do this by assuming that there are no open transactions anywhere and then calling getAcidState. If there are + * obsolete directories, then the Cleaner has more work to do. + * @param location location of table + * @return number of dirs left for the cleaner to clean – eventually + * @throws IOException + */ + private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots) + throws IOException { +ValidTxnList validTxnList = new ValidReadTxnList(); +//save it so that getAcidState() sees it +conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString()); +ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList(); +Path locPath = new Path(location); +AcidUtils.Directory dir = AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, validWriteIdList, +Ref.from(false), false, dirSnapshots); +return dir.getObsolete().size(); Review comment: New deltas by themselves don't stop marking the compaction cleaned (they are not in obsolete list), but overlapping compactions do, I think problematic scenario could be if there are some long running ETL jobs also next to your streaming writes. That would mean the that cleaning jobs would pile up, they would run continuously, but the low values in min_history_level would prevent them to clean every obsolete up and ever marked as cleaned :( This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519448) Time Spent: 5h 10m (was: 5h) > compactor.Cleaner should not set state "mark cleaned" if there are obsolete > files in the FS > --- > > Key: HIVE-2 > URL: https://issues.apache.org/jira/browse/HIVE-2 > Project: Hive > Issue Type: Bug >Reporter: Karen Coppage >Assignee: Karen Coppage >Priority: Major > Labels: pull-request-available > Time Spent: 5h 10m > Remaining Estimate: 0h > > This is an improvement on HIVE-24314, in which markCleaned() is called only > if +any+ files are deleted by the cleaner. This could cause a problem in the > following case: > Say for table_1 compaction1 cleaning was blocked by an open txn, and > compaction is run again on the same table (compaction2). Both compaction1 and > compaction2 could be in "ready for cleaning" at the same time. By this time > the blocking open txn could be committed. When the cleaner runs, one of > compaction1 and compaction2 will remain in the "ready for cleaning" state: > Say compaction2 is picked up by the cleaner first. The Cleaner deletes all > obsolete files. Then compaction1 is picked up by the cleaner; the cleaner > doesn't remove any files and compaction1 will stay in the queue in a "ready > for cleaning" state. > HIVE-24291 already solves this issue but if it isn't usable (for example if > HMS schema changes are out the question) then HIVE-24314 + this change will > fix the issue of the Cleaner not removing all obsolete files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24473) Update HBase version to 2.1.10
[ https://issues.apache.org/jira/browse/HIVE-24473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Istvan Toth updated HIVE-24473: --- Description: Hive currently builds with a 2.0.0 pre-release. Update HBase to more recent version. We cannot use anything later than 2.2.4 because of HBASE-22394 So the options are 2.1.10 and 2.2.4 I suggest 2.1.10 because it's a chronologically later release, and it maximises compatibility with HBase server deployments. was: Hive currently builds with a 2.0.0 pre-release. Update HBase to more recent version. We cannot use anything later than 2.2.4 because of HBASE-22394 So the options are 2.1.10 and 2.2.4 I suggest 2.1.10 because it's a chronologically later release, and it maximises compatibility HBase server deployments. > Update HBase version to 2.1.10 > -- > > Key: HIVE-24473 > URL: https://issues.apache.org/jira/browse/HIVE-24473 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 4.0.0 >Reporter: Istvan Toth >Assignee: Istvan Toth >Priority: Major > Labels: pull-request-available > Attachments: HIVE-24473.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Hive currently builds with a 2.0.0 pre-release. > Update HBase to more recent version. > We cannot use anything later than 2.2.4 because of HBASE-22394 > So the options are 2.1.10 and 2.2.4 > I suggest 2.1.10 because it's a chronologically later release, and it > maximises compatibility with HBase server deployments. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24473) Update HBase version to 2.1.10
[ https://issues.apache.org/jira/browse/HIVE-24473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Istvan Toth updated HIVE-24473: --- Attachment: HIVE-24473.patch Status: Patch Available (was: Open) > Update HBase version to 2.1.10 > -- > > Key: HIVE-24473 > URL: https://issues.apache.org/jira/browse/HIVE-24473 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 4.0.0 >Reporter: Istvan Toth >Assignee: Istvan Toth >Priority: Major > Labels: pull-request-available > Attachments: HIVE-24473.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Hive currently builds with a 2.0.0 pre-release. > Update HBase to more recent version. > We cannot use anything later than 2.2.4 because of HBASE-22394 > So the options are 2.1.10 and 2.2.4 > I suggest 2.1.10 because it's a chronologically later release, and it > maximises compatibility HBase server deployments. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24473) Update HBase version to 2.1.10
[ https://issues.apache.org/jira/browse/HIVE-24473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-24473: -- Labels: pull-request-available (was: ) > Update HBase version to 2.1.10 > -- > > Key: HIVE-24473 > URL: https://issues.apache.org/jira/browse/HIVE-24473 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 4.0.0 >Reporter: Istvan Toth >Assignee: Istvan Toth >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Hive currently builds with a 2.0.0 pre-release. > Update HBase to more recent version. > We cannot use anything later than 2.2.4 because of HBASE-22394 > So the options are 2.1.10 and 2.2.4 > I suggest 2.1.10 because it's a chronologically later release, and it > maximises compatibility HBase server deployments. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24473) Update HBase version to 2.1.10
[ https://issues.apache.org/jira/browse/HIVE-24473?focusedWorklogId=519441=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519441 ] ASF GitHub Bot logged work on HIVE-24473: - Author: ASF GitHub Bot Created on: 03/Dec/20 07:16 Start Date: 03/Dec/20 07:16 Worklog Time Spent: 10m Work Description: stoty opened a new pull request #1729: URL: https://github.com/apache/hive/pull/1729 ### What changes were proposed in this pull request? Update included HBase version to 2.1.10 ### Why are the changes needed? Currently Hive includes an old-pre-relase version of HBase. The included version is a GA (if older) release with a lot of fixes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Hive test suite run successfully This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519441) Remaining Estimate: 0h Time Spent: 10m > Update HBase version to 2.1.10 > -- > > Key: HIVE-24473 > URL: https://issues.apache.org/jira/browse/HIVE-24473 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 4.0.0 >Reporter: Istvan Toth >Assignee: Istvan Toth >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Hive currently builds with a 2.0.0 pre-release. > Update HBase to more recent version. > We cannot use anything later than 2.2.4 because of HBASE-22394 > So the options are 2.1.10 and 2.2.4 > I suggest 2.1.10 because it's a chronologically later release, and it > maximises compatibility HBase server deployments. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HIVE-24473) Update HBase version to 2.1.10
[ https://issues.apache.org/jira/browse/HIVE-24473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Istvan Toth reassigned HIVE-24473: -- > Update HBase version to 2.1.10 > -- > > Key: HIVE-24473 > URL: https://issues.apache.org/jira/browse/HIVE-24473 > Project: Hive > Issue Type: Improvement > Components: HBase Handler >Affects Versions: 4.0.0 >Reporter: Istvan Toth >Assignee: Istvan Toth >Priority: Major > > Hive currently builds with a 2.0.0 pre-release. > Update HBase to more recent version. > We cannot use anything later than 2.2.4 because of HBASE-22394 > So the options are 2.1.10 and 2.2.4 > I suggest 2.1.10 because it's a chronologically later release, and it > maximises compatibility HBase server deployments. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-24472) Optimize LlapTaskSchedulerService::preemptTasksFromMap
[ https://issues.apache.org/jira/browse/HIVE-24472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242961#comment-17242961 ] Rajesh Balamohan commented on HIVE-24472: - Ref: Q14 in tpcds > Optimize LlapTaskSchedulerService::preemptTasksFromMap > -- > > Key: HIVE-24472 > URL: https://issues.apache.org/jira/browse/HIVE-24472 > Project: Hive > Issue Type: Improvement >Reporter: Rajesh Balamohan >Priority: Major > Attachments: Screenshot 2020-12-03 at 12.13.03 PM.png > > > !Screenshot 2020-12-03 at 12.13.03 PM.png|width=1063,height=571! > speculativeTasks could possibly include node information to reduce CPU burn > in LlapTaskSchedulerService::preemptTasksFromMap > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HIVE-24471) Add support for combiner in hash mode group aggregation
[ https://issues.apache.org/jira/browse/HIVE-24471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mahesh kumar behera reassigned HIVE-24471: -- > Add support for combiner in hash mode group aggregation > > > Key: HIVE-24471 > URL: https://issues.apache.org/jira/browse/HIVE-24471 > Project: Hive > Issue Type: Bug > Components: Hive >Reporter: mahesh kumar behera >Assignee: mahesh kumar behera >Priority: Major > > In map side group aggregation, partial grouped aggregation is calculated to > reduce the data written to disk by map task. In case of hash aggregation, > where the input data is not sorted, hash table is used. If the hash table > size increases beyond configurable limit, data is flushed to disk and new > hash table is generated. If the reduction by hash table is less than min hash > aggregation reduction calculated during compile time, the map side > aggregation is converted to streaming mode. So if the first few batch of > records does not result into significant reduction, then the mode is switched > to streaming mode. This may have impact on performance, if the subsequent > batch of records have less number of distinct values. To mitigate this > situation, a combiner can be added to the map task after the keys are sorted. > This will make sure that the aggregation is done if possible and reduce the > data written to disk. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24468) Use Event Time instead of Current Time in Notification Log DB Entry
[ https://issues.apache.org/jira/browse/HIVE-24468?focusedWorklogId=519396=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519396 ] ASF GitHub Bot logged work on HIVE-24468: - Author: ASF GitHub Bot Created on: 03/Dec/20 05:39 Start Date: 03/Dec/20 05:39 Worklog Time Spent: 10m Work Description: pvary commented on pull request #1728: URL: https://github.com/apache/hive/pull/1728#issuecomment-737678785 Could this cause out of order notification timestamps? If we are sure that nobody relies on timestamps to check notification order (bad practice) then we can change, but I would be cautious about changing this, as notifications are widely used API. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519396) Time Spent: 40m (was: 0.5h) > Use Event Time instead of Current Time in Notification Log DB Entry > --- > > Key: HIVE-24468 > URL: https://issues.apache.org/jira/browse/HIVE-24468 > Project: Hive > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HIVE-24470) Separate HiveMetastore Thrift and Driver logic
[ https://issues.apache.org/jira/browse/HIVE-24470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Moberg reassigned HIVE-24470: - > Separate HiveMetastore Thrift and Driver logic > -- > > Key: HIVE-24470 > URL: https://issues.apache.org/jira/browse/HIVE-24470 > Project: Hive > Issue Type: Improvement > Components: Standalone Metastore >Reporter: Cameron Moberg >Assignee: Cameron Moberg >Priority: Minor > > In the file HiveMetastore.java the majority of the code is a thrift interface > rather than the actual logic behind starting hive metastore, this should be > moved out into a separate file to clean up the file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24397) Add the projection specification to the table request object and add placeholders in ObjectStore.java
[ https://issues.apache.org/jira/browse/HIVE-24397?focusedWorklogId=519254=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519254 ] ASF GitHub Bot logged work on HIVE-24397: - Author: ASF GitHub Bot Created on: 02/Dec/20 21:24 Start Date: 02/Dec/20 21:24 Worklog Time Spent: 10m Work Description: vnhive commented on a change in pull request #1681: URL: https://github.com/apache/hive/pull/1681#discussion_r530816743 ## File path: standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/client/TestTablesGetExists.java ## @@ -402,6 +393,134 @@ public void testGetTableObjectsByName() throws Exception { } + @Test + public void testGetTableObjectsWithProjectionOfSingleField() throws Exception { +List tableNames = new ArrayList<>(); +tableNames.add(testTables[0].getTableName()); +tableNames.add(testTables[1].getTableName()); + +GetTablesRequest request = new GetTablesRequest(); +request.setProjectionSpec(new GetProjectionsSpec()); +request.setTblNames(tableNames); +request.setDbName(DEFAULT_DATABASE); + +GetProjectionsSpec projectSpec = request.getProjectionSpec(); +List projectedFields = Collections.singletonList("sd.location"); +projectSpec.setFieldList(projectedFields); + +List tables = client.getTableObjectsByRequest(request); + +Assert.assertEquals("Found tables", 2, tables.size()); + +for(Table table : tables) { + Assert.assertFalse(table.isSetDbName()); + Assert.assertFalse(table.isSetCatName()); + Assert.assertFalse(table.isSetTableName()); + Assert.assertTrue(table.isSetSd()); +} + } + + @Test + public void testGetTableObjectsWithNullProjectionSpec() throws Exception { +List tableNames = new ArrayList<>(); +tableNames.add(testTables[0].getTableName()); +tableNames.add(testTables[1].getTableName()); + +GetTablesRequest request = new GetTablesRequest(); +request.setProjectionSpec(null); +request.setTblNames(tableNames); +request.setDbName(DEFAULT_DATABASE); + +List tables = client.getTableObjectsByRequest(request); + +Assert.assertEquals("Found tables", 2, tables.size()); + } + + @Test + public void testGetTableObjectsWithNonExistentColumn() throws Exception { +List tableNames = new ArrayList<>(); +tableNames.add(testTables[0].getTableName()); +tableNames.add(testTables[1].getTableName()); + +GetTablesRequest request = new GetTablesRequest(); +request.setProjectionSpec(new GetProjectionsSpec()); +request.setTblNames(tableNames); +request.setDbName(DEFAULT_DATABASE); + +GetProjectionsSpec projectSpec = request.getProjectionSpec(); +List projectedFields = Arrays.asList("Invalid1"); +projectSpec.setFieldList(projectedFields); + +Assert.assertThrows(Exception.class, ()->client.getTableObjectsByRequest(request)); + } + + + @Test + public void testGetTableObjectsWithNonExistentColumns() throws Exception { +List tableNames = new ArrayList<>(); +tableNames.add(testTables[0].getTableName()); +tableNames.add(testTables[1].getTableName()); + +GetTablesRequest request = new GetTablesRequest(); +request.setProjectionSpec(new GetProjectionsSpec()); +request.setTblNames(tableNames); +request.setDbName(DEFAULT_DATABASE); + +GetProjectionsSpec projectSpec = request.getProjectionSpec(); +List projectedFields = Arrays.asList("Invalid1", "Invalid2"); +projectSpec.setFieldList(projectedFields); + +Assert.assertThrows(Exception.class, ()->client.getTableObjectsByRequest(request)); + } + + @Test + public void testGetTableObjectsWithEmptyProjection() throws Exception { +List tableNames = new ArrayList<>(); +tableNames.add(testTables[0].getTableName()); +tableNames.add(testTables[1].getTableName()); + +GetTablesRequest request = new GetTablesRequest(); +request.setProjectionSpec(new GetProjectionsSpec()); +request.setTblNames(tableNames); +request.setDbName(DEFAULT_DATABASE); + +GetProjectionsSpec projectSpec = request.getProjectionSpec(); +List projectedFields = Arrays.asList(); +projectSpec.setFieldList(projectedFields); + +List tables = client.getTableObjectsByRequest(request); + +Assert.assertEquals("Found tables", 0, tables.size()); + } + + @Test + public void testGetTableObjectsWithProjectionOfMultipleField() throws Exception { +List tableNames = new ArrayList<>(); +tableNames.add(testTables[0].getTableName()); +tableNames.add(testTables[1].getTableName()); + +GetTablesRequest request = new GetTablesRequest(); +request.setProjectionSpec(new GetProjectionsSpec()); +request.setTblNames(tableNames); +request.setDbName(DEFAULT_DATABASE); + +GetProjectionsSpec projectSpec = request.getProjectionSpec(); +List projectedFields = Arrays.asList("database",
[jira] [Work logged] (HIVE-24468) Use Event Time instead of Current Time in Notification Log DB Entry
[ https://issues.apache.org/jira/browse/HIVE-24468?focusedWorklogId=519219=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519219 ] ASF GitHub Bot logged work on HIVE-24468: - Author: ASF GitHub Bot Created on: 02/Dec/20 20:12 Start Date: 02/Dec/20 20:12 Worklog Time Spent: 10m Work Description: belugabehr opened a new pull request #1728: URL: https://github.com/apache/hive/pull/1728 …g DB Entry ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519219) Time Spent: 0.5h (was: 20m) > Use Event Time instead of Current Time in Notification Log DB Entry > --- > > Key: HIVE-24468 > URL: https://issues.apache.org/jira/browse/HIVE-24468 > Project: Hive > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24468) Use Event Time instead of Current Time in Notification Log DB Entry
[ https://issues.apache.org/jira/browse/HIVE-24468?focusedWorklogId=519218=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519218 ] ASF GitHub Bot logged work on HIVE-24468: - Author: ASF GitHub Bot Created on: 02/Dec/20 20:06 Start Date: 02/Dec/20 20:06 Worklog Time Spent: 10m Work Description: belugabehr closed pull request #1728: URL: https://github.com/apache/hive/pull/1728 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519218) Time Spent: 20m (was: 10m) > Use Event Time instead of Current Time in Notification Log DB Entry > --- > > Key: HIVE-24468 > URL: https://issues.apache.org/jira/browse/HIVE-24468 > Project: Hive > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS
[ https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519211=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519211 ] ASF GitHub Bot logged work on HIVE-2: - Author: ASF GitHub Bot Created on: 02/Dec/20 19:45 Start Date: 02/Dec/20 19:45 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1716: URL: https://github.com/apache/hive/pull/1716#discussion_r534377951 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java ## @@ -316,6 +314,30 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa } fs.delete(dead, true); } -return true; +// Check if there will be more obsolete directories to clean when possible. We will only mark cleaned when this +// number reaches 0. +return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0; + } + + /** + * Get the number of base/delta directories the Cleaner should remove eventually. If we check this after cleaning + * we can see if the Cleaner has further work to do in this table/partition directory that it hasn't been able to + * finish, e.g. because of an open transaction at the time of compaction. + * We do this by assuming that there are no open transactions anywhere and then calling getAcidState. If there are + * obsolete directories, then the Cleaner has more work to do. + * @param location location of table + * @return number of dirs left for the cleaner to clean – eventually + * @throws IOException + */ + private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots) + throws IOException { +ValidTxnList validTxnList = new ValidReadTxnList(); +//save it so that getAcidState() sees it +conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString()); +ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList(); +Path locPath = new Path(location); +AcidUtils.Directory dir = AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, validWriteIdList, +Ref.from(false), false, dirSnapshots); +return dir.getObsolete().size(); Review comment: consider Case2: How the situation is going to change if instead of aborted txns we have a successful ones? Imagine writes arrive continuously via streaming. Won't new deltas prevent us from cleaning up obsolete ones and marking cleanup operation as completed for the corresponding compaction, but rather pile up cleanup requests in a queue? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519211) Time Spent: 5h (was: 4h 50m) > compactor.Cleaner should not set state "mark cleaned" if there are obsolete > files in the FS > --- > > Key: HIVE-2 > URL: https://issues.apache.org/jira/browse/HIVE-2 > Project: Hive > Issue Type: Bug >Reporter: Karen Coppage >Assignee: Karen Coppage >Priority: Major > Labels: pull-request-available > Time Spent: 5h > Remaining Estimate: 0h > > This is an improvement on HIVE-24314, in which markCleaned() is called only > if +any+ files are deleted by the cleaner. This could cause a problem in the > following case: > Say for table_1 compaction1 cleaning was blocked by an open txn, and > compaction is run again on the same table (compaction2). Both compaction1 and > compaction2 could be in "ready for cleaning" at the same time. By this time > the blocking open txn could be committed. When the cleaner runs, one of > compaction1 and compaction2 will remain in the "ready for cleaning" state: > Say compaction2 is picked up by the cleaner first. The Cleaner deletes all > obsolete files. Then compaction1 is picked up by the cleaner; the cleaner > doesn't remove any files and compaction1 will stay in the queue in a "ready > for cleaning" state. > HIVE-24291 already solves this issue but if it isn't usable (for example if > HMS schema changes are out the question) then HIVE-24314 + this change will > fix the issue of the Cleaner not removing all obsolete files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24403) change min_history_level schema change to be compatible with previous version
[ https://issues.apache.org/jira/browse/HIVE-24403?focusedWorklogId=519182=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519182 ] ASF GitHub Bot logged work on HIVE-24403: - Author: ASF GitHub Bot Created on: 02/Dec/20 18:22 Start Date: 02/Dec/20 18:22 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1688: URL: https://github.com/apache/hive/pull/1688#discussion_r534385104 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/DatabaseProduct.java ## @@ -186,6 +186,19 @@ public boolean isDeadlock(SQLException e) { || e.getMessage().contains("can't serialize access for this transaction"; } + /** + * Is the given exception a table not found exception + * @param e Exception + * @return + */ + public boolean isTableNotExists(SQLException e) { Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519182) Time Spent: 4h (was: 3h 50m) > change min_history_level schema change to be compatible with previous version > - > > Key: HIVE-24403 > URL: https://issues.apache.org/jira/browse/HIVE-24403 > Project: Hive > Issue Type: Improvement > Components: Metastore >Reporter: Peter Varga >Assignee: Peter Varga >Priority: Major > Labels: pull-request-available > Time Spent: 4h > Remaining Estimate: 0h > > In some configurations the HMS backend DB is used by HMS services with > different versions. > HIVE-23107 dropped the min_history_level table from the backend DB making > the new schema version incompatible with the older HMS services. > It is possible to modify that change to keep the compatibility > * Keep the min_history_level table > * Add the new fields for the compaction_queue the same way > * Create a feature flag for min_history_level and if it is on > * Keep the logic inserting to the table during openTxn > * Keep the logic removing the records at commitTxn and abortTxn > * Change the logic in the cleaner, to get the highwatermark the old way > * But still change it to not start the cleaning before that > * The txn_to_write_id table cleaning can work the new way in the new version > and the old way in the old version > * This feature flag can be automatically setup based on the existence of the > min_history level table, this way if the table will be dropped all HMS-s can > switch to the new functionality without restart -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24432) Delete Notification Events in Batches
[ https://issues.apache.org/jira/browse/HIVE-24432?focusedWorklogId=519180=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519180 ] ASF GitHub Bot logged work on HIVE-24432: - Author: ASF GitHub Bot Created on: 02/Dec/20 18:20 Start Date: 02/Dec/20 18:20 Worklog Time Spent: 10m Work Description: belugabehr opened a new pull request #1710: URL: https://github.com/apache/hive/pull/1710 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519180) Time Spent: 50m (was: 40m) > Delete Notification Events in Batches > - > > Key: HIVE-24432 > URL: https://issues.apache.org/jira/browse/HIVE-24432 > Project: Hive > Issue Type: Improvement >Affects Versions: 3.2.0 >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Notification events are loaded in batches (reduces memory pressure on the > HMS), but all of the deletes happen under a single transactions and, when > deleting many records, can put a lot of pressure on the backend database. > Instead, delete events in batches (in different transactions) as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24403) change min_history_level schema change to be compatible with previous version
[ https://issues.apache.org/jira/browse/HIVE-24403?focusedWorklogId=519178=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519178 ] ASF GitHub Bot logged work on HIVE-24403: - Author: ASF GitHub Bot Created on: 02/Dec/20 18:19 Start Date: 02/Dec/20 18:19 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1688: URL: https://github.com/apache/hive/pull/1688#discussion_r534383384 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/TxnHandler.java ## @@ -5094,6 +5153,99 @@ public void countOpenTxns() throws MetaException { } } + /** + * Add min history level entry for each generated txn record + * @param dbConn Connection + * @param txnIds new transaction ids + * @deprecated Remove this method when min_history_level table is dropped + * @throws SQLException ex + */ + @Deprecated + private void addTxnToMinHistoryLevel(Connection dbConn, List txnIds, long minOpenTxnId) throws SQLException { +if (!useMinHistoryLevel) { + return; +} +// Need to register minimum open txnid for current transactions into MIN_HISTORY table. +try (Statement stmt = dbConn.createStatement()) { + + List rows = txnIds.stream().map(txnId -> txnId + ", " + minOpenTxnId).collect(Collectors.toList()); + + // Insert transaction entries into MIN_HISTORY_LEVEL. + List inserts = + sqlGenerator.createInsertValuesStmt("\"MIN_HISTORY_LEVEL\" (\"MHL_TXNID\", \"MHL_MIN_OPEN_TXNID\")", rows); + for (String insert : inserts) { +LOG.debug("Going to execute insert <" + insert + ">"); +stmt.execute(insert); + } + LOG.info("Added entries to MIN_HISTORY_LEVEL for current txns: (" + txnIds + ") with min_open_txn: " + minOpenTxnId); +} catch (SQLException e) { + if (dbProduct.isTableNotExists(e)) { +// If the table does not exists anymore, we disable the flag and start to work the new way +// This enables to switch to the new functionality without a restart +useMinHistoryLevel = false; Review comment: The idea is multiple HMS is using the same backend db, you upgrade them one by one, the last one changes the schema, all the others change to the new functionality, after the first call to min_history table. Do you have any practical example of wrapping them in aspect, I do not want to much more code complexity just to avoid checking an exception in four places. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519178) Time Spent: 3h 50m (was: 3h 40m) > change min_history_level schema change to be compatible with previous version > - > > Key: HIVE-24403 > URL: https://issues.apache.org/jira/browse/HIVE-24403 > Project: Hive > Issue Type: Improvement > Components: Metastore >Reporter: Peter Varga >Assignee: Peter Varga >Priority: Major > Labels: pull-request-available > Time Spent: 3h 50m > Remaining Estimate: 0h > > In some configurations the HMS backend DB is used by HMS services with > different versions. > HIVE-23107 dropped the min_history_level table from the backend DB making > the new schema version incompatible with the older HMS services. > It is possible to modify that change to keep the compatibility > * Keep the min_history_level table > * Add the new fields for the compaction_queue the same way > * Create a feature flag for min_history_level and if it is on > * Keep the logic inserting to the table during openTxn > * Keep the logic removing the records at commitTxn and abortTxn > * Change the logic in the cleaner, to get the highwatermark the old way > * But still change it to not start the cleaning before that > * The txn_to_write_id table cleaning can work the new way in the new version > and the old way in the old version > * This feature flag can be automatically setup based on the existence of the > min_history level table, this way if the table will be dropped all HMS-s can > switch to the new functionality without restart -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS
[ https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519172=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519172 ] ASF GitHub Bot logged work on HIVE-2: - Author: ASF GitHub Bot Created on: 02/Dec/20 18:11 Start Date: 02/Dec/20 18:11 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1716: URL: https://github.com/apache/hive/pull/1716#discussion_r534377951 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java ## @@ -316,6 +314,30 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa } fs.delete(dead, true); } -return true; +// Check if there will be more obsolete directories to clean when possible. We will only mark cleaned when this +// number reaches 0. +return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0; + } + + /** + * Get the number of base/delta directories the Cleaner should remove eventually. If we check this after cleaning + * we can see if the Cleaner has further work to do in this table/partition directory that it hasn't been able to + * finish, e.g. because of an open transaction at the time of compaction. + * We do this by assuming that there are no open transactions anywhere and then calling getAcidState. If there are + * obsolete directories, then the Cleaner has more work to do. + * @param location location of table + * @return number of dirs left for the cleaner to clean – eventually + * @throws IOException + */ + private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots) + throws IOException { +ValidTxnList validTxnList = new ValidReadTxnList(); +//save it so that getAcidState() sees it +conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString()); +ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList(); +Path locPath = new Path(location); +AcidUtils.Directory dir = AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, validWriteIdList, +Ref.from(false), false, dirSnapshots); +return dir.getObsolete().size(); Review comment: considering Case2: How the situation is going to change if instead of aborted txn you have successful write? Writes coming in continuously (streaming). Won't new delta prevent us from marking clean as complete for corresponding compaction request and we'll have just pilled up cleanup requests in a queue? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519172) Time Spent: 4h 50m (was: 4h 40m) > compactor.Cleaner should not set state "mark cleaned" if there are obsolete > files in the FS > --- > > Key: HIVE-2 > URL: https://issues.apache.org/jira/browse/HIVE-2 > Project: Hive > Issue Type: Bug >Reporter: Karen Coppage >Assignee: Karen Coppage >Priority: Major > Labels: pull-request-available > Time Spent: 4h 50m > Remaining Estimate: 0h > > This is an improvement on HIVE-24314, in which markCleaned() is called only > if +any+ files are deleted by the cleaner. This could cause a problem in the > following case: > Say for table_1 compaction1 cleaning was blocked by an open txn, and > compaction is run again on the same table (compaction2). Both compaction1 and > compaction2 could be in "ready for cleaning" at the same time. By this time > the blocking open txn could be committed. When the cleaner runs, one of > compaction1 and compaction2 will remain in the "ready for cleaning" state: > Say compaction2 is picked up by the cleaner first. The Cleaner deletes all > obsolete files. Then compaction1 is picked up by the cleaner; the cleaner > doesn't remove any files and compaction1 will stay in the queue in a "ready > for cleaning" state. > HIVE-24291 already solves this issue but if it isn't usable (for example if > HMS schema changes are out the question) then HIVE-24314 + this change will > fix the issue of the Cleaner not removing all obsolete files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24432) Delete Notification Events in Batches
[ https://issues.apache.org/jira/browse/HIVE-24432?focusedWorklogId=519168=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519168 ] ASF GitHub Bot logged work on HIVE-24432: - Author: ASF GitHub Bot Created on: 02/Dec/20 18:02 Start Date: 02/Dec/20 18:02 Worklog Time Spent: 10m Work Description: belugabehr closed pull request #1710: URL: https://github.com/apache/hive/pull/1710 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519168) Time Spent: 40m (was: 0.5h) > Delete Notification Events in Batches > - > > Key: HIVE-24432 > URL: https://issues.apache.org/jira/browse/HIVE-24432 > Project: Hive > Issue Type: Improvement >Affects Versions: 3.2.0 >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Notification events are loaded in batches (reduces memory pressure on the > HMS), but all of the deletes happen under a single transactions and, when > deleting many records, can put a lot of pressure on the backend database. > Instead, delete events in batches (in different transactions) as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24403) change min_history_level schema change to be compatible with previous version
[ https://issues.apache.org/jira/browse/HIVE-24403?focusedWorklogId=519162=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519162 ] ASF GitHub Bot logged work on HIVE-24403: - Author: ASF GitHub Bot Created on: 02/Dec/20 17:53 Start Date: 02/Dec/20 17:53 Worklog Time Spent: 10m Work Description: deniskuzZ commented on a change in pull request #1688: URL: https://github.com/apache/hive/pull/1688#discussion_r534366541 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/TxnDbUtil.java ## @@ -385,6 +391,26 @@ public static String queryToString(Configuration conf, String query, boolean inc return sb.toString(); } + /** + * This is only for testing, it does not use the connectionPool from TxnHandler! + * @param conf + * @param query + * @throws Exception + */ + @VisibleForTesting + public static void executeUpdate(Configuration conf, String query) Review comment: in this case we should consider refactoring this class This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519162) Time Spent: 3h 40m (was: 3.5h) > change min_history_level schema change to be compatible with previous version > - > > Key: HIVE-24403 > URL: https://issues.apache.org/jira/browse/HIVE-24403 > Project: Hive > Issue Type: Improvement > Components: Metastore >Reporter: Peter Varga >Assignee: Peter Varga >Priority: Major > Labels: pull-request-available > Time Spent: 3h 40m > Remaining Estimate: 0h > > In some configurations the HMS backend DB is used by HMS services with > different versions. > HIVE-23107 dropped the min_history_level table from the backend DB making > the new schema version incompatible with the older HMS services. > It is possible to modify that change to keep the compatibility > * Keep the min_history_level table > * Add the new fields for the compaction_queue the same way > * Create a feature flag for min_history_level and if it is on > * Keep the logic inserting to the table during openTxn > * Keep the logic removing the records at commitTxn and abortTxn > * Change the logic in the cleaner, to get the highwatermark the old way > * But still change it to not start the cleaning before that > * The txn_to_write_id table cleaning can work the new way in the new version > and the old way in the old version > * This feature flag can be automatically setup based on the existence of the > min_history level table, this way if the table will be dropped all HMS-s can > switch to the new functionality without restart -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24403) change min_history_level schema change to be compatible with previous version
[ https://issues.apache.org/jira/browse/HIVE-24403?focusedWorklogId=519160=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519160 ] ASF GitHub Bot logged work on HIVE-24403: - Author: ASF GitHub Bot Created on: 02/Dec/20 17:52 Start Date: 02/Dec/20 17:52 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1688: URL: https://github.com/apache/hive/pull/1688#discussion_r534365714 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/TxnHandler.java ## @@ -670,6 +725,8 @@ public OpenTxnsResponse openTxns(OpenTxnRequest rqst) throws MetaException { assert txnIds.size() == numTxns; + addTxnToMinHistoryLevel(dbConn, txnIds, minOpenTxnId); Review comment: That was my first intent, but it resulted in lock timeout, after inserting the new records in the txns table, the min open select was not running This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519160) Time Spent: 3.5h (was: 3h 20m) > change min_history_level schema change to be compatible with previous version > - > > Key: HIVE-24403 > URL: https://issues.apache.org/jira/browse/HIVE-24403 > Project: Hive > Issue Type: Improvement > Components: Metastore >Reporter: Peter Varga >Assignee: Peter Varga >Priority: Major > Labels: pull-request-available > Time Spent: 3.5h > Remaining Estimate: 0h > > In some configurations the HMS backend DB is used by HMS services with > different versions. > HIVE-23107 dropped the min_history_level table from the backend DB making > the new schema version incompatible with the older HMS services. > It is possible to modify that change to keep the compatibility > * Keep the min_history_level table > * Add the new fields for the compaction_queue the same way > * Create a feature flag for min_history_level and if it is on > * Keep the logic inserting to the table during openTxn > * Keep the logic removing the records at commitTxn and abortTxn > * Change the logic in the cleaner, to get the highwatermark the old way > * But still change it to not start the cleaning before that > * The txn_to_write_id table cleaning can work the new way in the new version > and the old way in the old version > * This feature flag can be automatically setup based on the existence of the > min_history level table, this way if the table will be dropped all HMS-s can > switch to the new functionality without restart -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24403) change min_history_level schema change to be compatible with previous version
[ https://issues.apache.org/jira/browse/HIVE-24403?focusedWorklogId=519148=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519148 ] ASF GitHub Bot logged work on HIVE-24403: - Author: ASF GitHub Bot Created on: 02/Dec/20 17:40 Start Date: 02/Dec/20 17:40 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1688: URL: https://github.com/apache/hive/pull/1688#discussion_r534357567 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/TxnHandler.java ## @@ -390,6 +404,42 @@ public void setConf(Configuration conf){ } } + /** + * Check if min_history_level table is usable + * @return + * @throws MetaException + */ + private boolean checkMinHistoryLevelTable(boolean configValue) throws MetaException { +if (!configValue) { + // don't check it if disabled + return false; +} +Connection dbConn = null; +boolean tableExists = true; +try { + dbConn = getDbConn(Connection.TRANSACTION_READ_COMMITTED); + try (Statement stmt = dbConn.createStatement()) { +// Dummy query to see if table exists +try (ResultSet rs = stmt.executeQuery("SELECT MIN(\"MHL_MIN_OPEN_TXNID\") FROM \"MIN_HISTORY_LEVEL\"")) { Review comment: fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519148) Time Spent: 3h 20m (was: 3h 10m) > change min_history_level schema change to be compatible with previous version > - > > Key: HIVE-24403 > URL: https://issues.apache.org/jira/browse/HIVE-24403 > Project: Hive > Issue Type: Improvement > Components: Metastore >Reporter: Peter Varga >Assignee: Peter Varga >Priority: Major > Labels: pull-request-available > Time Spent: 3h 20m > Remaining Estimate: 0h > > In some configurations the HMS backend DB is used by HMS services with > different versions. > HIVE-23107 dropped the min_history_level table from the backend DB making > the new schema version incompatible with the older HMS services. > It is possible to modify that change to keep the compatibility > * Keep the min_history_level table > * Add the new fields for the compaction_queue the same way > * Create a feature flag for min_history_level and if it is on > * Keep the logic inserting to the table during openTxn > * Keep the logic removing the records at commitTxn and abortTxn > * Change the logic in the cleaner, to get the highwatermark the old way > * But still change it to not start the cleaning before that > * The txn_to_write_id table cleaning can work the new way in the new version > and the old way in the old version > * This feature flag can be automatically setup based on the existence of the > min_history level table, this way if the table will be dropped all HMS-s can > switch to the new functionality without restart -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24403) change min_history_level schema change to be compatible with previous version
[ https://issues.apache.org/jira/browse/HIVE-24403?focusedWorklogId=519146=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519146 ] ASF GitHub Bot logged work on HIVE-24403: - Author: ASF GitHub Bot Created on: 02/Dec/20 17:39 Start Date: 02/Dec/20 17:39 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1688: URL: https://github.com/apache/hive/pull/1688#discussion_r534357150 ## File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/txn/TxnDbUtil.java ## @@ -385,6 +391,26 @@ public static String queryToString(Configuration conf, String query, boolean inc return sb.toString(); } + /** + * This is only for testing, it does not use the connectionPool from TxnHandler! + * @param conf + * @param query + * @throws Exception + */ + @VisibleForTesting + public static void executeUpdate(Configuration conf, String query) Review comment: Well, this class is a test utility. /** * Utility methods for creating and destroying txn database/schema, plus methods for * querying against metastore tables. * Placed here in a separate class so it can be shared across unit tests. */ public final class TxnDbUtil The problem is more like getEpochFn and executeQueriesInBatchNoCount was added to this class, those are production code. I know it would be nicer if it would be in a test package, but then it would be harder to use in 5 different projects This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519146) Time Spent: 3h 10m (was: 3h) > change min_history_level schema change to be compatible with previous version > - > > Key: HIVE-24403 > URL: https://issues.apache.org/jira/browse/HIVE-24403 > Project: Hive > Issue Type: Improvement > Components: Metastore >Reporter: Peter Varga >Assignee: Peter Varga >Priority: Major > Labels: pull-request-available > Time Spent: 3h 10m > Remaining Estimate: 0h > > In some configurations the HMS backend DB is used by HMS services with > different versions. > HIVE-23107 dropped the min_history_level table from the backend DB making > the new schema version incompatible with the older HMS services. > It is possible to modify that change to keep the compatibility > * Keep the min_history_level table > * Add the new fields for the compaction_queue the same way > * Create a feature flag for min_history_level and if it is on > * Keep the logic inserting to the table during openTxn > * Keep the logic removing the records at commitTxn and abortTxn > * Change the logic in the cleaner, to get the highwatermark the old way > * But still change it to not start the cleaning before that > * The txn_to_write_id table cleaning can work the new way in the new version > and the old way in the old version > * This feature flag can be automatically setup based on the existence of the > min_history level table, this way if the table will be dropped all HMS-s can > switch to the new functionality without restart -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS
[ https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519128=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519128 ] ASF GitHub Bot logged work on HIVE-2: - Author: ASF GitHub Bot Created on: 02/Dec/20 17:19 Start Date: 02/Dec/20 17:19 Worklog Time Spent: 10m Work Description: pvargacl commented on a change in pull request #1716: URL: https://github.com/apache/hive/pull/1716#discussion_r534342957 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java ## @@ -316,6 +314,30 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa } fs.delete(dead, true); } -return true; +// Check if there will be more obsolete directories to clean when possible. We will only mark cleaned when this +// number reaches 0. +return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0; + } + + /** + * Get the number of base/delta directories the Cleaner should remove eventually. If we check this after cleaning + * we can see if the Cleaner has further work to do in this table/partition directory that it hasn't been able to + * finish, e.g. because of an open transaction at the time of compaction. + * We do this by assuming that there are no open transactions anywhere and then calling getAcidState. If there are + * obsolete directories, then the Cleaner has more work to do. + * @param location location of table + * @return number of dirs left for the cleaner to clean – eventually + * @throws IOException + */ + private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots) + throws IOException { +ValidTxnList validTxnList = new ValidReadTxnList(); +//save it so that getAcidState() sees it +conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString()); +ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList(); +Path locPath = new Path(location); +AcidUtils.Directory dir = AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, validWriteIdList, +Ref.from(false), false, dirSnapshots); +return dir.getObsolete().size(); Review comment: Case 1: If HIVE-23107 and the following are there I think none of these checks are necessary, because we can be sure, that Cleaner was running when it could delete everything it can. Also if delayed cleaning is enabled it is guaranteed, that it will never delete any more obsolete directories no matter how many times it is running (see: validWriteIdList.updateHighWatermark(ci.highestWriteId)). If we must choose, checking if anything was removed does less damage Case 2: If those fixes are not there I think checking for obsolete files is better than checking if anything was removed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519128) Time Spent: 4h 40m (was: 4.5h) > compactor.Cleaner should not set state "mark cleaned" if there are obsolete > files in the FS > --- > > Key: HIVE-2 > URL: https://issues.apache.org/jira/browse/HIVE-2 > Project: Hive > Issue Type: Bug >Reporter: Karen Coppage >Assignee: Karen Coppage >Priority: Major > Labels: pull-request-available > Time Spent: 4h 40m > Remaining Estimate: 0h > > This is an improvement on HIVE-24314, in which markCleaned() is called only > if +any+ files are deleted by the cleaner. This could cause a problem in the > following case: > Say for table_1 compaction1 cleaning was blocked by an open txn, and > compaction is run again on the same table (compaction2). Both compaction1 and > compaction2 could be in "ready for cleaning" at the same time. By this time > the blocking open txn could be committed. When the cleaner runs, one of > compaction1 and compaction2 will remain in the "ready for cleaning" state: > Say compaction2 is picked up by the cleaner first. The Cleaner deletes all > obsolete files. Then compaction1 is picked up by the cleaner; the cleaner > doesn't remove any files and compaction1 will stay in the queue in a "ready > for cleaning" state. > HIVE-24291 already solves this issue but if it isn't usable (for example if > HMS schema changes are out the question) then HIVE-24314 + this change will > fix the issue of the Cleaner not removing all obsolete files. -- This message was sent by Atlassian Jira
[jira] [Work logged] (HIVE-24432) Delete Notification Events in Batches
[ https://issues.apache.org/jira/browse/HIVE-24432?focusedWorklogId=519120=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519120 ] ASF GitHub Bot logged work on HIVE-24432: - Author: ASF GitHub Bot Created on: 02/Dec/20 16:59 Start Date: 02/Dec/20 16:59 Worklog Time Spent: 10m Work Description: belugabehr opened a new pull request #1710: URL: https://github.com/apache/hive/pull/1710 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519120) Time Spent: 0.5h (was: 20m) > Delete Notification Events in Batches > - > > Key: HIVE-24432 > URL: https://issues.apache.org/jira/browse/HIVE-24432 > Project: Hive > Issue Type: Improvement >Affects Versions: 3.2.0 >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Notification events are loaded in batches (reduces memory pressure on the > HMS), but all of the deletes happen under a single transactions and, when > deleting many records, can put a lot of pressure on the backend database. > Instead, delete events in batches (in different transactions) as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24432) Delete Notification Events in Batches
[ https://issues.apache.org/jira/browse/HIVE-24432?focusedWorklogId=519117=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519117 ] ASF GitHub Bot logged work on HIVE-24432: - Author: ASF GitHub Bot Created on: 02/Dec/20 16:58 Start Date: 02/Dec/20 16:58 Worklog Time Spent: 10m Work Description: belugabehr closed pull request #1710: URL: https://github.com/apache/hive/pull/1710 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519117) Time Spent: 20m (was: 10m) > Delete Notification Events in Batches > - > > Key: HIVE-24432 > URL: https://issues.apache.org/jira/browse/HIVE-24432 > Project: Hive > Issue Type: Improvement >Affects Versions: 3.2.0 >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Notification events are loaded in batches (reduces memory pressure on the > HMS), but all of the deletes happen under a single transactions and, when > deleting many records, can put a lot of pressure on the backend database. > Instead, delete events in batches (in different transactions) as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-24432) Delete Notification Events in Batches
[ https://issues.apache.org/jira/browse/HIVE-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242511#comment-17242511 ] David Mollitor commented on HIVE-24432: --- [~anishek] Yes. I thought about implementing it this way, however, it's not always that simple. Since Hive is using an ORM it can sometimes have a negative effect when doing modifications to the DB directly since the ORM caches fall out of sync with the state of the DB after that modification. I think in this case, it might be OK, but less risk do it in this manner and the clean up isn't too important in terms of performance. > Delete Notification Events in Batches > - > > Key: HIVE-24432 > URL: https://issues.apache.org/jira/browse/HIVE-24432 > Project: Hive > Issue Type: Improvement >Affects Versions: 3.2.0 >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Notification events are loaded in batches (reduces memory pressure on the > HMS), but all of the deletes happen under a single transactions and, when > deleting many records, can put a lot of pressure on the backend database. > Instead, delete events in batches (in different transactions) as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24436) Fix Avro NULL_DEFAULT_VALUE compatibility issue
[ https://issues.apache.org/jira/browse/HIVE-24436?focusedWorklogId=519091=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519091 ] ASF GitHub Bot logged work on HIVE-24436: - Author: ASF GitHub Bot Created on: 02/Dec/20 16:18 Start Date: 02/Dec/20 16:18 Worklog Time Spent: 10m Work Description: iemejia commented on pull request #1722: URL: https://github.com/apache/hive/pull/1722#issuecomment-737336194 Excellent! For info vote for Avro 1.10.1 (that fixes the null default fix) is almost over and artifacts should be published tomorrow or the day after I will update my PR once it is out. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519091) Time Spent: 2.5h (was: 2h 20m) > Fix Avro NULL_DEFAULT_VALUE compatibility issue > --- > > Key: HIVE-24436 > URL: https://issues.apache.org/jira/browse/HIVE-24436 > Project: Hive > Issue Type: Improvement > Components: Avro >Affects Versions: 2.3.8 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: pull-request-available > Fix For: 2.3.8, 3.1.3, 4.0.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > Exception1: > {noformat} > - create hive serde table with Catalog > *** RUN ABORTED *** > java.lang.NoSuchMethodError: 'void > org.apache.avro.Schema$Field.(java.lang.String, org.apache.avro.Schema, > java.lang.String, org.codehaus.jackson.JsonNode)' > at > org.apache.hadoop.hive.serde2.avro.TypeInfoToSchema.createAvroField(TypeInfoToSchema.java:76) > at > org.apache.hadoop.hive.serde2.avro.TypeInfoToSchema.convert(TypeInfoToSchema.java:61) > at > org.apache.hadoop.hive.serde2.avro.AvroSerDe.getSchemaFromCols(AvroSerDe.java:170) > at > org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:114) > at > org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:83) > at > org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:533) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:450) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:437) > at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:281) > at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:263) > {noformat} > Exception2: > {noformat} > - alter hive serde table add columns -- partitioned - AVRO *** FAILED *** > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: > org.apache.avro.AvroRuntimeException: Unknown datum class: class > org.codehaus.jackson.node.NullNode; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:346) > at > org.apache.spark.sql.execution.command.CreateTableCommand.run(tables.scala:166) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3680) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-24468) Use Event Time instead of Current Time in Notification Log DB Entry
[ https://issues.apache.org/jira/browse/HIVE-24468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242465#comment-17242465 ] Anishek Agarwal commented on HIVE-24468: +1 > Use Event Time instead of Current Time in Notification Log DB Entry > --- > > Key: HIVE-24468 > URL: https://issues.apache.org/jira/browse/HIVE-24468 > Project: Hive > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24453) Direct SQL error when parsing create_time value for database
[ https://issues.apache.org/jira/browse/HIVE-24453?focusedWorklogId=519085=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519085 ] ASF GitHub Bot logged work on HIVE-24453: - Author: ASF GitHub Bot Created on: 02/Dec/20 15:56 Start Date: 02/Dec/20 15:56 Worklog Time Spent: 10m Work Description: jcamachor merged pull request #1719: URL: https://github.com/apache/hive/pull/1719 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519085) Time Spent: 20m (was: 10m) > Direct SQL error when parsing create_time value for database > > > Key: HIVE-24453 > URL: https://issues.apache.org/jira/browse/HIVE-24453 > Project: Hive > Issue Type: Bug > Components: Metastore >Reporter: Jesus Camacho Rodriguez >Assignee: Jesus Camacho Rodriguez >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > HIVE-21077 introduced a {{create_time}} field for {{DBS}} table in HMS. > Although the value for that field is always set after that patch, the value > could be null if the database was created before the feature went in. > DirectSQL should check for null value before parsing the integer, otherwise > we hit an exception and fallback to ORM path: > {code} > 2020-11-28 09:06:05,414 WARN org.apache.hadoop.hive.metastore.ObjectStore: > [pool-8-thread-194]: Falling back to ORM path due to direct SQL failure (this > is not an error): null at > org.apache.hadoop.hive.metastore.MetastoreDirectSqlUtils.extractSqlInt(MetastoreDirectSqlUtils.java:251) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getDatabase(MetaStoreDirectSql.java:420) > at > org.apache.hadoop.hive.metastore.ObjectStore$1.getSqlResult(ObjectStore.java:839) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24453) Direct SQL error when parsing create_time value for database
[ https://issues.apache.org/jira/browse/HIVE-24453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jesus Camacho Rodriguez updated HIVE-24453: --- Fix Version/s: 4.0.0 Resolution: Fixed Status: Resolved (was: Patch Available) Pushed to master, thanks for the review [~kkasa]! > Direct SQL error when parsing create_time value for database > > > Key: HIVE-24453 > URL: https://issues.apache.org/jira/browse/HIVE-24453 > Project: Hive > Issue Type: Bug > Components: Metastore >Reporter: Jesus Camacho Rodriguez >Assignee: Jesus Camacho Rodriguez >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > HIVE-21077 introduced a {{create_time}} field for {{DBS}} table in HMS. > Although the value for that field is always set after that patch, the value > could be null if the database was created before the feature went in. > DirectSQL should check for null value before parsing the integer, otherwise > we hit an exception and fallback to ORM path: > {code} > 2020-11-28 09:06:05,414 WARN org.apache.hadoop.hive.metastore.ObjectStore: > [pool-8-thread-194]: Falling back to ORM path due to direct SQL failure (this > is not an error): null at > org.apache.hadoop.hive.metastore.MetastoreDirectSqlUtils.extractSqlInt(MetastoreDirectSqlUtils.java:251) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.getDatabase(MetaStoreDirectSql.java:420) > at > org.apache.hadoop.hive.metastore.ObjectStore$1.getSqlResult(ObjectStore.java:839) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24444) compactor.Cleaner should not set state "mark cleaned" if there are obsolete files in the FS
[ https://issues.apache.org/jira/browse/HIVE-2?focusedWorklogId=519051=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519051 ] ASF GitHub Bot logged work on HIVE-2: - Author: ASF GitHub Bot Created on: 02/Dec/20 14:50 Start Date: 02/Dec/20 14:50 Worklog Time Spent: 10m Work Description: klcopp commented on a change in pull request #1716: URL: https://github.com/apache/hive/pull/1716#discussion_r534226306 ## File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java ## @@ -316,6 +314,30 @@ private boolean removeFiles(String location, ValidWriteIdList writeIdList, Compa } fs.delete(dead, true); } -return true; +// Check if there will be more obsolete directories to clean when possible. We will only mark cleaned when this +// number reaches 0. +return getNumEventuallyObsoleteDirs(location, dirSnapshots) == 0; + } + + /** + * Get the number of base/delta directories the Cleaner should remove eventually. If we check this after cleaning + * we can see if the Cleaner has further work to do in this table/partition directory that it hasn't been able to + * finish, e.g. because of an open transaction at the time of compaction. + * We do this by assuming that there are no open transactions anywhere and then calling getAcidState. If there are + * obsolete directories, then the Cleaner has more work to do. + * @param location location of table + * @return number of dirs left for the cleaner to clean – eventually + * @throws IOException + */ + private int getNumEventuallyObsoleteDirs(String location, Map dirSnapshots) + throws IOException { +ValidTxnList validTxnList = new ValidReadTxnList(); +//save it so that getAcidState() sees it +conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.writeToString()); +ValidReaderWriteIdList validWriteIdList = new ValidReaderWriteIdList(); +Path locPath = new Path(location); +AcidUtils.Directory dir = AcidUtils.getAcidState(locPath.getFileSystem(conf), locPath, conf, validWriteIdList, +Ref.from(false), false, dirSnapshots); +return dir.getObsolete().size(); Review comment: Okay, I see what you mean. I removed the aborted files from the total. In general do you think checking for obsolete files is better than checking whether we removed any files? Case 1: Assuming HIVE-23107 etc. are present in the version? Case 2: Assuming HIVE-23107 etc. are _not_ present in the version? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519051) Time Spent: 4.5h (was: 4h 20m) > compactor.Cleaner should not set state "mark cleaned" if there are obsolete > files in the FS > --- > > Key: HIVE-2 > URL: https://issues.apache.org/jira/browse/HIVE-2 > Project: Hive > Issue Type: Bug >Reporter: Karen Coppage >Assignee: Karen Coppage >Priority: Major > Labels: pull-request-available > Time Spent: 4.5h > Remaining Estimate: 0h > > This is an improvement on HIVE-24314, in which markCleaned() is called only > if +any+ files are deleted by the cleaner. This could cause a problem in the > following case: > Say for table_1 compaction1 cleaning was blocked by an open txn, and > compaction is run again on the same table (compaction2). Both compaction1 and > compaction2 could be in "ready for cleaning" at the same time. By this time > the blocking open txn could be committed. When the cleaner runs, one of > compaction1 and compaction2 will remain in the "ready for cleaning" state: > Say compaction2 is picked up by the cleaner first. The Cleaner deletes all > obsolete files. Then compaction1 is picked up by the cleaner; the cleaner > doesn't remove any files and compaction1 will stay in the queue in a "ready > for cleaning" state. > HIVE-24291 already solves this issue but if it isn't usable (for example if > HMS schema changes are out the question) then HIVE-24314 + this change will > fix the issue of the Cleaner not removing all obsolete files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24460) Refactor Get Next Event ID for DbNotificationListener
[ https://issues.apache.org/jira/browse/HIVE-24460?focusedWorklogId=519034=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519034 ] ASF GitHub Bot logged work on HIVE-24460: - Author: ASF GitHub Bot Created on: 02/Dec/20 14:22 Start Date: 02/Dec/20 14:22 Worklog Time Spent: 10m Work Description: belugabehr commented on a change in pull request #1725: URL: https://github.com/apache/hive/pull/1725#discussion_r534203356 ## File path: hcatalog/server-extensions/src/main/java/org/apache/hive/hcatalog/listener/DbNotificationListener.java ## @@ -1217,7 +1251,7 @@ private void addNotificationLog(NotificationEvent event, ListenerEvent listenerE params.add(catName); } - s = "insert into \"NOTIFICATION_LOG\" (" + columns + ") VALUES (" + insertVal + ")"; + String s = "insert into \"NOTIFICATION_LOG\" (" + columns + ") VALUES (" + insertVal + ")"; Review comment: @miklosgergely Yes, for sure. And I can take a look at that in a future refactoring, but this request is out of scope for this change which only affects the generation of the "Next Event ID." Thanks for the review. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519034) Time Spent: 0.5h (was: 20m) > Refactor Get Next Event ID for DbNotificationListener > - > > Key: HIVE-24460 > URL: https://issues.apache.org/jira/browse/HIVE-24460 > Project: Hive > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Refactor event ID generation to match notification log ID generation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24450) DbNotificationListener Request Notification IDs in Batches
[ https://issues.apache.org/jira/browse/HIVE-24450?focusedWorklogId=519033=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519033 ] ASF GitHub Bot logged work on HIVE-24450: - Author: ASF GitHub Bot Created on: 02/Dec/20 14:20 Start Date: 02/Dec/20 14:20 Worklog Time Spent: 10m Work Description: belugabehr commented on pull request #1718: URL: https://github.com/apache/hive/pull/1718#issuecomment-737258804 Thank you all for the feedback. Please review my notes: https://issues.apache.org/jira/browse/HIVE-24450 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519033) Time Spent: 50m (was: 40m) > DbNotificationListener Request Notification IDs in Batches > -- > > Key: HIVE-24450 > URL: https://issues.apache.org/jira/browse/HIVE-24450 > Project: Hive > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Every time a new notification event is logged into the database, the sequence > number for the ID of the even is incremented by one. It is very standard in > database design to instead request a block of IDs for each fetch from the > database. The sequence numbers are then handed out locally until the block > of IDs is exhausted. This allows for fewer database round-trips and > transactions, at the expense of perhaps burning a few IDs. > Burning of IDs happens when the server is restarted in the middle of a block > of sequence IDs. That is, if the HMS requests a block of 10 ids, and only > three have been assigned, after the restart, the HMS will request another > block of 10, burning (wasting) 7 IDs. As long as the blocks are not too > small, and restarts are infrequent, then few IDs are lost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24450) DbNotificationListener Request Notification IDs in Batches
[ https://issues.apache.org/jira/browse/HIVE-24450?focusedWorklogId=519032=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519032 ] ASF GitHub Bot logged work on HIVE-24450: - Author: ASF GitHub Bot Created on: 02/Dec/20 14:20 Start Date: 02/Dec/20 14:20 Worklog Time Spent: 10m Work Description: belugabehr closed pull request #1718: URL: https://github.com/apache/hive/pull/1718 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519032) Time Spent: 40m (was: 0.5h) > DbNotificationListener Request Notification IDs in Batches > -- > > Key: HIVE-24450 > URL: https://issues.apache.org/jira/browse/HIVE-24450 > Project: Hive > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Every time a new notification event is logged into the database, the sequence > number for the ID of the even is incremented by one. It is very standard in > database design to instead request a block of IDs for each fetch from the > database. The sequence numbers are then handed out locally until the block > of IDs is exhausted. This allows for fewer database round-trips and > transactions, at the expense of perhaps burning a few IDs. > Burning of IDs happens when the server is restarted in the middle of a block > of sequence IDs. That is, if the HMS requests a block of 10 ids, and only > three have been assigned, after the restart, the HMS will request another > block of 10, burning (wasting) 7 IDs. As long as the blocks are not too > small, and restarts are infrequent, then few IDs are lost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HIVE-24450) DbNotificationListener Request Notification IDs in Batches
[ https://issues.apache.org/jira/browse/HIVE-24450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor resolved HIVE-24450. --- Resolution: Won't Fix > DbNotificationListener Request Notification IDs in Batches > -- > > Key: HIVE-24450 > URL: https://issues.apache.org/jira/browse/HIVE-24450 > Project: Hive > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Every time a new notification event is logged into the database, the sequence > number for the ID of the even is incremented by one. It is very standard in > database design to instead request a block of IDs for each fetch from the > database. The sequence numbers are then handed out locally until the block > of IDs is exhausted. This allows for fewer database round-trips and > transactions, at the expense of perhaps burning a few IDs. > Burning of IDs happens when the server is restarted in the middle of a block > of sequence IDs. That is, if the HMS requests a block of 10 ids, and only > three have been assigned, after the restart, the HMS will request another > block of 10, burning (wasting) 7 IDs. As long as the blocks are not too > small, and restarts are infrequent, then few IDs are lost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-24450) DbNotificationListener Request Notification IDs in Batches
[ https://issues.apache.org/jira/browse/HIVE-24450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242389#comment-17242389 ] David Mollitor commented on HIVE-24450: --- Would be great if you could also look at HIVE-24468 as well. Thanks. > DbNotificationListener Request Notification IDs in Batches > -- > > Key: HIVE-24450 > URL: https://issues.apache.org/jira/browse/HIVE-24450 > Project: Hive > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Every time a new notification event is logged into the database, the sequence > number for the ID of the even is incremented by one. It is very standard in > database design to instead request a block of IDs for each fetch from the > database. The sequence numbers are then handed out locally until the block > of IDs is exhausted. This allows for fewer database round-trips and > transactions, at the expense of perhaps burning a few IDs. > Burning of IDs happens when the server is restarted in the middle of a block > of sequence IDs. That is, if the HMS requests a block of 10 ids, and only > three have been assigned, after the restart, the HMS will request another > block of 10, burning (wasting) 7 IDs. As long as the blocks are not too > small, and restarts are infrequent, then few IDs are lost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24468) Use Event Time instead of Current Time in Notification Log DB Entry
[ https://issues.apache.org/jira/browse/HIVE-24468?focusedWorklogId=519029=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519029 ] ASF GitHub Bot logged work on HIVE-24468: - Author: ASF GitHub Bot Created on: 02/Dec/20 14:18 Start Date: 02/Dec/20 14:18 Worklog Time Spent: 10m Work Description: belugabehr opened a new pull request #1728: URL: https://github.com/apache/hive/pull/1728 …g DB Entry ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519029) Remaining Estimate: 0h Time Spent: 10m > Use Event Time instead of Current Time in Notification Log DB Entry > --- > > Key: HIVE-24468 > URL: https://issues.apache.org/jira/browse/HIVE-24468 > Project: Hive > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HIVE-24468) Use Event Time instead of Current Time in Notification Log DB Entry
[ https://issues.apache.org/jira/browse/HIVE-24468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-24468: -- Labels: pull-request-available (was: ) > Use Event Time instead of Current Time in Notification Log DB Entry > --- > > Key: HIVE-24468 > URL: https://issues.apache.org/jira/browse/HIVE-24468 > Project: Hive > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HIVE-24468) Use Event Time instead of Current Time in Notification Log DB Entry
[ https://issues.apache.org/jira/browse/HIVE-24468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor reassigned HIVE-24468: - > Use Event Time instead of Current Time in Notification Log DB Entry > --- > > Key: HIVE-24468 > URL: https://issues.apache.org/jira/browse/HIVE-24468 > Project: Hive > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HIVE-24450) DbNotificationListener Request Notification IDs in Batches
[ https://issues.apache.org/jira/browse/HIVE-24450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242383#comment-17242383 ] David Mollitor commented on HIVE-24450: --- [~aasha] [~pvargacl] [~anishek], Thanks for the review! That is unfortunate (re: performance) but thank you for clarifying. Can you please take a look at HIVE-24463 ? I have added a special case for MySQL to better performance and I have changed the code so that incrementing by 1 is hardcoded. As it is currently written, the code makes the reader believe that the counter can be incremented by an arbitrary amount. > DbNotificationListener Request Notification IDs in Batches > -- > > Key: HIVE-24450 > URL: https://issues.apache.org/jira/browse/HIVE-24450 > Project: Hive > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Every time a new notification event is logged into the database, the sequence > number for the ID of the even is incremented by one. It is very standard in > database design to instead request a block of IDs for each fetch from the > database. The sequence numbers are then handed out locally until the block > of IDs is exhausted. This allows for fewer database round-trips and > transactions, at the expense of perhaps burning a few IDs. > Burning of IDs happens when the server is restarted in the middle of a block > of sequence IDs. That is, if the HMS requests a block of 10 ids, and only > three have been assigned, after the restart, the HMS will request another > block of 10, burning (wasting) 7 IDs. As long as the blocks are not too > small, and restarts are infrequent, then few IDs are lost. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-23965) Improve plan regression tests using TPCDS30TB metastore dump and custom configs
[ https://issues.apache.org/jira/browse/HIVE-23965?focusedWorklogId=519022=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-519022 ] ASF GitHub Bot logged work on HIVE-23965: - Author: ASF GitHub Bot Created on: 02/Dec/20 14:00 Start Date: 02/Dec/20 14:00 Worklog Time Spent: 10m Work Description: zabetak commented on pull request #1714: URL: https://github.com/apache/hive/pull/1714#issuecomment-737247269 Hey @kgyrtkirk can you have another look mostly on https://github.com/apache/hive/pull/1714/commits/df6e610c7f7b11b0bf06b500b25613c1a811c055 please? Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 519022) Time Spent: 5h 50m (was: 5h 40m) > Improve plan regression tests using TPCDS30TB metastore dump and custom > configs > --- > > Key: HIVE-23965 > URL: https://issues.apache.org/jira/browse/HIVE-23965 > Project: Hive > Issue Type: Improvement >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: master355.tgz > > Time Spent: 5h 50m > Remaining Estimate: 0h > > The existing regression tests (HIVE-12586) based on TPC-DS have certain > shortcomings: > The table statistics do not reflect cardinalities from a specific TPC-DS > scale factor (SF). Some tables are from a 30TB dataset, others from 200GB > dataset, and others from a 3GB dataset. This mix leads to plans that may > never appear when using an actual TPC-DS dataset. > The existing statistics do not contain information about partitions something > that can have a big impact on the resulting plans. > The existing regression tests rely on more or less on the default > configuration (hive-site.xml). In real-life scenarios though some of the > configurations differ and may impact the choices of the optimizer. > This issue aims to address the above shortcomings by using a curated > TPCDS30TB metastore dump along with some custom hive configurations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HIVE-21919) Refactor Driver
[ https://issues.apache.org/jira/browse/HIVE-21919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miklos Gergely resolved HIVE-21919. --- Resolution: Fixed > Refactor Driver > --- > > Key: HIVE-21919 > URL: https://issues.apache.org/jira/browse/HIVE-21919 > Project: Hive > Issue Type: Improvement > Components: Hive >Affects Versions: 3.1.1 >Reporter: Miklos Gergely >Assignee: Miklos Gergely >Priority: Major > Labels: refactor-driver > Fix For: 4.0.0 > > > The Driver class is 3000+ lines long. It does a lot of things, it's structure > is hard to follow. Need to put it into a cleaner form, and thus make it more > readable. It should be cut into many pieces for having separate classes for > different subtasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HIVE-24333) Cut long methods in Driver to smaller, more manageable pieces
[ https://issues.apache.org/jira/browse/HIVE-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miklos Gergely resolved HIVE-24333. --- Resolution: Fixed Merged to master, thank you [~belugabehr] > Cut long methods in Driver to smaller, more manageable pieces > - > > Key: HIVE-24333 > URL: https://issues.apache.org/jira/browse/HIVE-24333 > Project: Hive > Issue Type: Sub-task > Components: Hive >Reporter: Miklos Gergely >Assignee: Miklos Gergely >Priority: Major > Labels: pull-request-available > Time Spent: 2h 50m > Remaining Estimate: 0h > > Some methods in Driver are too long to be easily understandable. They should > be cut into pieces to make them easier to understand. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24333) Cut long methods in Driver to smaller, more manageable pieces
[ https://issues.apache.org/jira/browse/HIVE-24333?focusedWorklogId=518984=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-518984 ] ASF GitHub Bot logged work on HIVE-24333: - Author: ASF GitHub Bot Created on: 02/Dec/20 12:51 Start Date: 02/Dec/20 12:51 Worklog Time Spent: 10m Work Description: miklosgergely merged pull request #1629: URL: https://github.com/apache/hive/pull/1629 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 518984) Time Spent: 2h 50m (was: 2h 40m) > Cut long methods in Driver to smaller, more manageable pieces > - > > Key: HIVE-24333 > URL: https://issues.apache.org/jira/browse/HIVE-24333 > Project: Hive > Issue Type: Sub-task > Components: Hive >Reporter: Miklos Gergely >Assignee: Miklos Gergely >Priority: Major > Labels: pull-request-available > Time Spent: 2h 50m > Remaining Estimate: 0h > > Some methods in Driver are too long to be easily understandable. They should > be cut into pieces to make them easier to understand. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24450) DbNotificationListener Request Notification IDs in Batches
[ https://issues.apache.org/jira/browse/HIVE-24450?focusedWorklogId=518891=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-518891 ] ASF GitHub Bot logged work on HIVE-24450: - Author: ASF GitHub Bot Created on: 02/Dec/20 09:19 Start Date: 02/Dec/20 09:19 Worklog Time Spent: 10m Work Description: aasha commented on pull request #1718: URL: https://github.com/apache/hive/pull/1718#issuecomment-737101280 In HA case how will the ordering of events be maintained? Acid Replication relies on the event sequence. So the ordering needs to be maintained. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 518891) Time Spent: 0.5h (was: 20m) > DbNotificationListener Request Notification IDs in Batches > -- > > Key: HIVE-24450 > URL: https://issues.apache.org/jira/browse/HIVE-24450 > Project: Hive > Issue Type: Improvement >Reporter: David Mollitor >Assignee: David Mollitor >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Every time a new notification event is logged into the database, the sequence > number for the ID of the even is incremented by one. It is very standard in > database design to instead request a block of IDs for each fetch from the > database. The sequence numbers are then handed out locally until the block > of IDs is exhausted. This allows for fewer database round-trips and > transactions, at the expense of perhaps burning a few IDs. > Burning of IDs happens when the server is restarted in the middle of a block > of sequence IDs. That is, if the HMS requests a block of 10 ids, and only > three have been assigned, after the restart, the HMS will request another > block of 10, burning (wasting) 7 IDs. As long as the blocks are not too > small, and restarts are infrequent, then few IDs are lost. -- This message was sent by Atlassian Jira (v8.3.4#803005)