[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions

2020-08-14 Thread Vihang Karajgaonkar (Code Review)
Vihang Karajgaonkar has uploaded a new patch set (#4). ( 
http://gerrit.cloudera.org:8080/16308 )

Change subject: IMPALA-4364: Query option to refresh updated partitions
..

IMPALA-4364: Query option to refresh updated partitions

This patch introduces a new boolean query option
REFRESH_UPDATED_HMS_PARTITIONS. When this query option is set
the refresh table command reloads the partitions which have been
modified in HMS in addition to adding [removing] the new [removed]
partitions.

In order to do this the refresh table command needs to fetch all
the partitions instead of the just the partition names which can
cause the performance of refresh table to degrade when the query
option is set. However for certain use-cases currently there is
no way to detect changed partitions using refresh table command.
For instance, if certain partition locations have been changed,
a refresh table will not update those partitions.

Testing:
1. Added a new test which sets the query option and makes sure
that the updated partitions from hive are reloaded after refresh
table command.
2. Ran exhaustive tests with the patch.

Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af
---
M be/src/service/client-request-state.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/CatalogService.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M tests/metadata/test_reset_metadata.py
12 files changed, 461 insertions(+), 107 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/16308/4
--
To view, visit http://gerrit.cloudera.org:8080/16308
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af
Gerrit-Change-Number: 16308
Gerrit-PatchSet: 4
Gerrit-Owner: Vihang Karajgaonkar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vihang Karajgaonkar 


[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions

2020-08-10 Thread Vihang Karajgaonkar (Code Review)
Vihang Karajgaonkar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16308 )

Change subject: IMPALA-4364: Query option to refresh updated partitions
..


Patch Set 2:

> (1 comment)
 >
 > I think this makes sense at a high level.
 >
 > Are you thinking that we'd want to optimise this further before
 > changing the default?

There is not much we can do here other than fetching only partially filled 
partitions using the API introduced in HIVE-19715. For partitions which are 
found to be updated, we will need to do a full partition fetch. Other than that 
I think we should profile this on a large sized table to identify how much of a 
hit this causes over the regular refresh table command.


--
To view, visit http://gerrit.cloudera.org:8080/16308
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af
Gerrit-Change-Number: 16308
Gerrit-PatchSet: 2
Gerrit-Owner: Vihang Karajgaonkar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vihang Karajgaonkar 
Gerrit-Comment-Date: Mon, 10 Aug 2020 21:10:40 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions

2020-08-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16308 )

Change subject: IMPALA-4364: Query option to refresh updated partitions
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/6860/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16308
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af
Gerrit-Change-Number: 16308
Gerrit-PatchSet: 2
Gerrit-Owner: Vihang Karajgaonkar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vihang Karajgaonkar 
Gerrit-Comment-Date: Mon, 10 Aug 2020 20:24:40 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions

2020-08-10 Thread Vihang Karajgaonkar (Code Review)
Vihang Karajgaonkar has uploaded a new patch set (#2). ( 
http://gerrit.cloudera.org:8080/16308 )

Change subject: IMPALA-4364: Query option to refresh updated partitions
..

IMPALA-4364: Query option to refresh updated partitions

This patch introduces a new boolean query option
ENABLE_REFRESH_UPDATED_PARTITIONS. When this query option is set
the refresh table command reloads the partitions which have been
modified in HMS in addition to adding [removing] the new [removed]
partitions.

In order to do this the refresh table command needs to fetch all
the partitions instead of the just the partition names which can
cause the performance of refresh table to degrade when the query
option is set. However, for certain use-cases there is no way
currently the refresh table doesn't detect changed partitions.
For instance, if certain partition locations have been changed,
a refresh table will not update those partitions.

Testing:
1. Added a new test which sets the query option and makes sure
that the updated partitions from hive are reloaded after refresh
table command.
2. Ran exhaustive tests with the patch.

Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af
---
M be/src/service/client-request-state.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/CatalogService.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java
M fe/src/main/java/org/apache/impala/catalog/DataSourceTable.java
M fe/src/main/java/org/apache/impala/catalog/HBaseTable.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/catalog/IcebergTable.java
M fe/src/main/java/org/apache/impala/catalog/IncompleteTable.java
M fe/src/main/java/org/apache/impala/catalog/KuduTable.java
M fe/src/main/java/org/apache/impala/catalog/Table.java
M fe/src/main/java/org/apache/impala/catalog/TableLoader.java
M fe/src/main/java/org/apache/impala/catalog/View.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M tests/metadata/test_reset_metadata.py
20 files changed, 465 insertions(+), 121 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/16308/2
--
To view, visit http://gerrit.cloudera.org:8080/16308
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af
Gerrit-Change-Number: 16308
Gerrit-PatchSet: 2
Gerrit-Owner: Vihang Karajgaonkar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Vihang Karajgaonkar 


[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions

2020-08-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16308 )

Change subject: IMPALA-4364: Query option to refresh updated partitions
..


Patch Set 1:

(1 comment)

I think this makes sense at a high level.

Are you thinking that we'd want to optimise this further before changing the 
default?

http://gerrit.cloudera.org:8080/#/c/16308/1/common/thrift/CatalogService.thrift
File common/thrift/CatalogService.thrift:

http://gerrit.cloudera.org:8080/#/c/16308/1/common/thrift/CatalogService.thrift@255
PS1, Line 255: refresh_modified_partitions
nit: name here doesn't match the query option name



--
To view, visit http://gerrit.cloudera.org:8080/16308
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af
Gerrit-Change-Number: 16308
Gerrit-PatchSet: 1
Gerrit-Owner: Vihang Karajgaonkar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 10 Aug 2020 18:09:08 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions

2020-08-09 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16308 )

Change subject: IMPALA-4364: Query option to refresh updated partitions
..


Patch Set 1:

(15 comments)

Have a first round look. It's nice to refactor the REFRESH code path! The 
solution looks good to me. Will have a more detailed look.

http://gerrit.cloudera.org:8080/#/c/16308/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16308/1//COMMIT_MSG@10
PS1, Line 10: ENABLE_REFRESH_UPDATED_PARTITIONS
I think we still refresh the file metadata of some kinds of updated partitions 
without this flag. E.g. partitions updated by INSERTs that just change the file 
meta. The key part is about refreshing hms metadata. Can we make its name more 
specifit, e.g. FORCE_REFRESH_HMS_META?


http://gerrit.cloudera.org:8080/#/c/16308/1/be/src/service/client-request-state.cc
File be/src/service/client-request-state.cc:

http://gerrit.cloudera.org:8080/#/c/16308/1/be/src/service/client-request-state.cc@a630
PS1, Line 630:
nit: don't need this?


http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@911
PS1, Line 911: the callers do not need to delete the incremental stats
 : // strings from the hmsParams_ map later.
I think this comment is stale after we change hmsParameters_ to an 
ImmutableMap. I think the reason is that hmsParameters_ is immutable. Make a 
copy so that the callers can modify the parameters.


http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1270
PS1, Line 1270: PartitionDeltaChecker deltaChecker =
Nice abstraction! Good to see that we refactor these codes.


http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1275
PS1, Line 1275: deltaChecker.apply();
nit: "Checker" sounds like it won't modify anything. What about changing its 
name to something like PartitionUpdater?


http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1327
PS1, Line 1327:  * @param partition
  :  * @return
  :  * @throws Exception
nit: remove these if no comments


http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1334
PS1, Line 1334: public abstract long loadUpdatedPartitions(
Could you also add a method comment for this? I think it means update(reload) 
both of the hms and file metadata of the given partition builders.


http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1373
PS1, Line 1373: loadPartitionsFromMetastore()
nit: stale comment, should be loadNewPartitions().


http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1406
PS1, Line 1406: HDFSTable
nit: HdfsTable


http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1419
PS1, Line 1419:   List partitionList =
  :   MetaStoreUtil.fetchAllPartitions(
  :   client_, db_.getName(), name_, 
NUM_PARTITION_FETCH_RETRIES);
This may be time-consuming. If partitionsToUpdate != null, can we just load 
msPartitions only for given partitions?


http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1451
PS1, Line 1451: we compare the StorageDescriptor which
nit: unfinished comment?


http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1453
PS1, Line 1453: if(!msPartition.getSd().equals(sd)) {
Can we add a test for this to verify that unchanged partitions won't go into 
this branch? We depends on the implementation of StorageDescriptor#equals(). 
One day if it changes and breaks our understandings, we can notice it in time.


http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/service/Frontend.java
File fe/src/main/java/org/apache/impala/service/Frontend.java:

http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/service/Frontend.java@581
PS1, Line 581:   LOG.info("VIHANG-DEBUG: Creating TResetMetadataRequest");
I think you forgot to remove this and line 1561, 1615.


http://gerrit.cloudera.org:8080/#/c/16308/1/tests/metadata/test_reset_metadata.py
File tests/metadata/test_reset_metadata.py:

http://gerrit.cloudera.org:8080/#/c/16308/1/tests/metadata/test_reset_metadata.py@54
PS1, Line 54: result
Could you comment 

[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions

2020-08-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16308 )

Change subject: IMPALA-4364: Query option to refresh updated partitions
..


Patch Set 1:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/6840/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16308
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af
Gerrit-Change-Number: 16308
Gerrit-PatchSet: 1
Gerrit-Owner: Vihang Karajgaonkar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 07 Aug 2020 21:55:11 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions

2020-08-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16308 )

Change subject: IMPALA-4364: Query option to refresh updated partitions
..


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1326
PS1, Line 1326:  * the given partition.
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1388
PS1, Line 1388: loadTimeForFileMdNs_ += 
loadFileMetadataForPartitions(client_, partitionsToLoadFiles,
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/16308/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@1483
PS1, Line 1483: public long loadUpdatedPartitions(Map updatedPartBuilders)
line too long (93 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/16308
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af
Gerrit-Change-Number: 16308
Gerrit-PatchSet: 1
Gerrit-Owner: Vihang Karajgaonkar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 07 Aug 2020 21:32:34 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4364: Query option to refresh updated partitions

2020-08-07 Thread Vihang Karajgaonkar (Code Review)
Vihang Karajgaonkar has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/16308


Change subject: IMPALA-4364: Query option to refresh updated partitions
..

IMPALA-4364: Query option to refresh updated partitions

This patch introduces a new boolean query option
ENABLE_REFRESH_UPDATED_PARTITIONS. When this query option is set
the refresh table command reloads the partitions which have been
modified in HMS in addition to adding [removing] the new [removed]
partitions.

In order to do this the refresh table command needs to fetch all
the partitions instead of the just the partition names which can
cause the performance of refresh table to degrade when the query
option is set. However, for certain use-cases there is no way
currently the refresh table doesn't detect changed partitions.
For instance, if certain partition locations have been changed,
a refresh table will not update those partitions.

Testing:
1. Added a new test which sets the query option and makes sure
that the updated partitions from hive are reloaded after refresh
table command.
2. [In-progress] Running exhaustive tests with the patch.

Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af
---
M be/src/service/client-request-state.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/CatalogService.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java
M fe/src/main/java/org/apache/impala/catalog/DataSourceTable.java
M fe/src/main/java/org/apache/impala/catalog/HBaseTable.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/catalog/IcebergTable.java
M fe/src/main/java/org/apache/impala/catalog/IncompleteTable.java
M fe/src/main/java/org/apache/impala/catalog/KuduTable.java
M fe/src/main/java/org/apache/impala/catalog/Table.java
M fe/src/main/java/org/apache/impala/catalog/TableLoader.java
M fe/src/main/java/org/apache/impala/catalog/View.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M tests/metadata/test_reset_metadata.py
20 files changed, 443 insertions(+), 125 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/16308/1
--
To view, visit http://gerrit.cloudera.org:8080/16308
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I50e8680509f4eb0712e7bb3de44df5f2952179af
Gerrit-Change-Number: 16308
Gerrit-PatchSet: 1
Gerrit-Owner: Vihang Karajgaonkar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong