[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions
Vihang Karajgaonkar has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/16308 ) Change subject: IMPALA-4364: Query option to refresh updated partitions .. IMPALA-4364: Query option to refresh updated partitions This patch introduces a new boolean query option REFRESH_UPDATED_HMS_PARTITIONS. When this query option is set the refresh table command reloads the partitions which have been modified in HMS in addition to adding [removing] the new [removed] partitions. In order to do this the refresh table command needs to fetch all the partitions instead of the just the partition names which can cause the performance of refresh table to degrade when the query option is set. However for certain use-cases currently there is no way to detect changed partitions using refresh table command. For instance, if certain partition locations have been changed, a refresh table will not update those partitions. Testing: 1. Added a new test which sets the query option and makes sure that the updated partitions from hive are reloaded after refresh table command. 2. Ran exhaustive tests with the patch. Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af --- M be/src/service/client-request-state.cc M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/CatalogService.thrift M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java M fe/src/main/java/org/apache/impala/service/Frontend.java M tests/metadata/test_reset_metadata.py 12 files changed, 461 insertions(+), 107 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/16308/4 -- To view, visit http://gerrit.cloudera.org:8080/16308 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af Gerrit-Change-Number: 16308 Gerrit-PatchSet: 4 Gerrit-Owner: Vihang Karajgaonkar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vihang Karajgaonkar
[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions
Vihang Karajgaonkar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16308 ) Change subject: IMPALA-4364: Query option to refresh updated partitions .. Patch Set 2: > (1 comment) > > I think this makes sense at a high level. > > Are you thinking that we'd want to optimise this further before > changing the default? There is not much we can do here other than fetching only partially filled partitions using the API introduced in HIVE-19715. For partitions which are found to be updated, we will need to do a full partition fetch. Other than that I think we should profile this on a large sized table to identify how much of a hit this causes over the regular refresh table command. -- To view, visit http://gerrit.cloudera.org:8080/16308 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af Gerrit-Change-Number: 16308 Gerrit-PatchSet: 2 Gerrit-Owner: Vihang Karajgaonkar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vihang Karajgaonkar Gerrit-Comment-Date: Mon, 10 Aug 2020 21:10:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16308 ) Change subject: IMPALA-4364: Query option to refresh updated partitions .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6860/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16308 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af Gerrit-Change-Number: 16308 Gerrit-PatchSet: 2 Gerrit-Owner: Vihang Karajgaonkar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vihang Karajgaonkar Gerrit-Comment-Date: Mon, 10 Aug 2020 20:24:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions
Vihang Karajgaonkar has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/16308 ) Change subject: IMPALA-4364: Query option to refresh updated partitions .. IMPALA-4364: Query option to refresh updated partitions This patch introduces a new boolean query option ENABLE_REFRESH_UPDATED_PARTITIONS. When this query option is set the refresh table command reloads the partitions which have been modified in HMS in addition to adding [removing] the new [removed] partitions. In order to do this the refresh table command needs to fetch all the partitions instead of the just the partition names which can cause the performance of refresh table to degrade when the query option is set. However, for certain use-cases there is no way currently the refresh table doesn't detect changed partitions. For instance, if certain partition locations have been changed, a refresh table will not update those partitions. Testing: 1. Added a new test which sets the query option and makes sure that the updated partitions from hive are reloaded after refresh table command. 2. Ran exhaustive tests with the patch. Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af --- M be/src/service/client-request-state.cc M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/CatalogService.thrift M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java M fe/src/main/java/org/apache/impala/catalog/DataSourceTable.java M fe/src/main/java/org/apache/impala/catalog/HBaseTable.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/catalog/IcebergTable.java M fe/src/main/java/org/apache/impala/catalog/IncompleteTable.java M fe/src/main/java/org/apache/impala/catalog/KuduTable.java M fe/src/main/java/org/apache/impala/catalog/Table.java M fe/src/main/java/org/apache/impala/catalog/TableLoader.java M fe/src/main/java/org/apache/impala/catalog/View.java M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java M fe/src/main/java/org/apache/impala/service/Frontend.java M tests/metadata/test_reset_metadata.py 20 files changed, 465 insertions(+), 121 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/16308/2 -- To view, visit http://gerrit.cloudera.org:8080/16308 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af Gerrit-Change-Number: 16308 Gerrit-PatchSet: 2 Gerrit-Owner: Vihang Karajgaonkar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Vihang Karajgaonkar
[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16308 ) Change subject: IMPALA-4364: Query option to refresh updated partitions .. Patch Set 1: (1 comment) I think this makes sense at a high level. Are you thinking that we'd want to optimise this further before changing the default? http://gerrit.cloudera.org:8080/#/c/16308/1/common/thrift/CatalogService.thrift File common/thrift/CatalogService.thrift: http://gerrit.cloudera.org:8080/#/c/16308/1/common/thrift/CatalogService.thrift@255 PS1, Line 255: refresh_modified_partitions nit: name here doesn't match the query option name -- To view, visit http://gerrit.cloudera.org:8080/16308 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af Gerrit-Change-Number: 16308 Gerrit-PatchSet: 1 Gerrit-Owner: Vihang Karajgaonkar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 10 Aug 2020 18:09:08 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16308 ) Change subject: IMPALA-4364: Query option to refresh updated partitions .. Patch Set 1: (15 comments) Have a first round look. It's nice to refactor the REFRESH code path! The solution looks good to me. Will have a more detailed look. http://gerrit.cloudera.org:8080/#/c/16308/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16308/1//COMMIT_MSG@10 PS1, Line 10: ENABLE_REFRESH_UPDATED_PARTITIONS I think we still refresh the file metadata of some kinds of updated partitions without this flag. E.g. partitions updated by INSERTs that just change the file meta. The key part is about refreshing hms metadata. Can we make its name more specifit, e.g. FORCE_REFRESH_HMS_META? http://gerrit.cloudera.org:8080/#/c/16308/1/be/src/service/client-request-state.cc File be/src/service/client-request-state.cc: http://gerrit.cloudera.org:8080/#/c/16308/1/be/src/service/client-request-state.cc@a630 PS1, Line 630: nit: don't need this? http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@911 PS1, Line 911: the callers do not need to delete the incremental stats : // strings from the hmsParams_ map later. I think this comment is stale after we change hmsParameters_ to an ImmutableMap. I think the reason is that hmsParameters_ is immutable. Make a copy so that the callers can modify the parameters. http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1270 PS1, Line 1270: PartitionDeltaChecker deltaChecker = Nice abstraction! Good to see that we refactor these codes. http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1275 PS1, Line 1275: deltaChecker.apply(); nit: "Checker" sounds like it won't modify anything. What about changing its name to something like PartitionUpdater? http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1327 PS1, Line 1327: * @param partition : * @return : * @throws Exception nit: remove these if no comments http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1334 PS1, Line 1334: public abstract long loadUpdatedPartitions( Could you also add a method comment for this? I think it means update(reload) both of the hms and file metadata of the given partition builders. http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1373 PS1, Line 1373: loadPartitionsFromMetastore() nit: stale comment, should be loadNewPartitions(). http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1406 PS1, Line 1406: HDFSTable nit: HdfsTable http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1419 PS1, Line 1419: List partitionList = : MetaStoreUtil.fetchAllPartitions( : client_, db_.getName(), name_, NUM_PARTITION_FETCH_RETRIES); This may be time-consuming. If partitionsToUpdate != null, can we just load msPartitions only for given partitions? http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1451 PS1, Line 1451: we compare the StorageDescriptor which nit: unfinished comment? http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1453 PS1, Line 1453: if(!msPartition.getSd().equals(sd)) { Can we add a test for this to verify that unchanged partitions won't go into this branch? We depends on the implementation of StorageDescriptor#equals(). One day if it changes and breaks our understandings, we can notice it in time. http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/service/Frontend.java File fe/src/main/java/org/apache/impala/service/Frontend.java: http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/service/Frontend.java@581 PS1, Line 581: LOG.info("VIHANG-DEBUG: Creating TResetMetadataRequest"); I think you forgot to remove this and line 1561, 1615. http://gerrit.cloudera.org:8080/#/c/16308/1/tests/metadata/test_reset_metadata.py File tests/metadata/test_reset_metadata.py: http://gerrit.cloudera.org:8080/#/c/16308/1/tests/metadata/test_reset_metadata.py@54 PS1, Line 54: result Could you comment
[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16308 ) Change subject: IMPALA-4364: Query option to refresh updated partitions .. Patch Set 1: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6840/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16308 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af Gerrit-Change-Number: 16308 Gerrit-PatchSet: 1 Gerrit-Owner: Vihang Karajgaonkar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 07 Aug 2020 21:55:11 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16308 ) Change subject: IMPALA-4364: Query option to refresh updated partitions .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1326 PS1, Line 1326: * the given partition. line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1388 PS1, Line 1388: loadTimeForFileMdNs_ += loadFileMetadataForPartitions(client_, partitionsToLoadFiles, line too long (93 > 90) http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1483 PS1, Line 1483: public long loadUpdatedPartitions(Map updatedPartBuilders) line too long (93 > 90) -- To view, visit http://gerrit.cloudera.org:8080/16308 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af Gerrit-Change-Number: 16308 Gerrit-PatchSet: 1 Gerrit-Owner: Vihang Karajgaonkar Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 07 Aug 2020 21:32:34 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions
Vihang Karajgaonkar has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16308 Change subject: IMPALA-4364: Query option to refresh updated partitions .. IMPALA-4364: Query option to refresh updated partitions This patch introduces a new boolean query option ENABLE_REFRESH_UPDATED_PARTITIONS. When this query option is set the refresh table command reloads the partitions which have been modified in HMS in addition to adding [removing] the new [removed] partitions. In order to do this the refresh table command needs to fetch all the partitions instead of the just the partition names which can cause the performance of refresh table to degrade when the query option is set. However, for certain use-cases there is no way currently the refresh table doesn't detect changed partitions. For instance, if certain partition locations have been changed, a refresh table will not update those partitions. Testing: 1. Added a new test which sets the query option and makes sure that the updated partitions from hive are reloaded after refresh table command. 2. [In-progress] Running exhaustive tests with the patch. Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af --- M be/src/service/client-request-state.cc M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/CatalogService.thrift M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java M fe/src/main/java/org/apache/impala/catalog/DataSourceTable.java M fe/src/main/java/org/apache/impala/catalog/HBaseTable.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/catalog/IcebergTable.java M fe/src/main/java/org/apache/impala/catalog/IncompleteTable.java M fe/src/main/java/org/apache/impala/catalog/KuduTable.java M fe/src/main/java/org/apache/impala/catalog/Table.java M fe/src/main/java/org/apache/impala/catalog/TableLoader.java M fe/src/main/java/org/apache/impala/catalog/View.java M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java M fe/src/main/java/org/apache/impala/service/Frontend.java M tests/metadata/test_reset_metadata.py 20 files changed, 443 insertions(+), 125 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/16308/1 -- To view, visit http://gerrit.cloudera.org:8080/16308 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af Gerrit-Change-Number: 16308 Gerrit-PatchSet: 1 Gerrit-Owner: Vihang Karajgaonkar Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong