[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operations on a table. It achieves it via assigning a unique row-id for each row, and maintaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_001_001_/_0 * full_acid/delta_002_002_/_0 * full_acid/delete_delta_003_003_/_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Reviewed-on: http://gerrit.cloudera.org:8080/16082 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins --- M common/thrift/CatalogObjects.thrift M common/thrift/CatalogService.thrift M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java M fe/src/main/java/org/apache/impala/catalog/FeFsTable.java M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java M fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java M fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsTable.java M fe/src/main/java/org/apache/impala/catalog/local/MetaProvider.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/util/AcidUtils.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test M tests/custom_cluster/test_local_catalog.py M tests/query_test/test_acid.py 27 files changed, 1,710 insertions(+), 151 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 15 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 14: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 14 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 14 Jul 2020 12:53:50 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 14: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6127/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 14 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 14 Jul 2020 07:49:54 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 14: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 14 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 14 Jul 2020 07:49:54 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 13: Code-Review+2 (1 comment) Thanks for adding bunch of tests! LGTM. Let's ship it! http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1541 PS11, Line 1541: /*isPartitionKeySc > I think you meant 'false', and you are right. And we also need to set it to Oops, yes, I mean 'false'... Thanks for addressing this. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 13 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 14 Jul 2020 00:20:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 12: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6576/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 12 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 13 Jul 2020 18:58:00 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 13: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6577/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 13 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 13 Jul 2020 18:57:44 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Hello Aman Sinha, Quanlong Huang, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16082 to look at the new patch set (#13). Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operations on a table. It achieves it via assigning a unique row-id for each row, and maintaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_001_001_/_0 * full_acid/delta_002_002_/_0 * full_acid/delete_delta_003_003_/_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 --- M common/thrift/CatalogObjects.thrift M common/thrift/CatalogService.thrift M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java M fe/src/main/java/org/apache/impala/catalog/FeFsTable.java M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java M fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java M fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsTable.java M fe/src/main/java/org/apache/impala/catalog/local/MetaProvider.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/util/AcidUtils.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test M tests/custom_cluster/test_local_catalog.py M tests/query_test/test_acid.py 27 files changed, 1,710 insertions(+), 151 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/13 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 13 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 12: (1 comment) http://gerrit.cloudera.org:8080/#/c/16082/12/tests/custom_cluster/test_local_catalog.py File tests/custom_cluster/test_local_catalog.py: http://gerrit.cloudera.org:8080/#/c/16082/12/tests/custom_cluster/test_local_catalog.py@30 PS12, Line 30: from tests.common.skip import (SkipIfHive2, SkipIfCatalogV2, SkipIfS3, SkipIfABFS, > flake8: F401 'tests.common.skip.SkipIfCatalogV2' imported but unused Done -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 12 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 13 Jul 2020 18:31:36 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 12: (9 comments) http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@1368 PS11, Line 1368: // Validate all the partition key/values to ensure you can convert them toThrift() : Expr.treesToThrift(getPartitionValues()); : } catch (Exception e) { : throw new CatalogException("Partition (" + getPartitionName() + : ") has invalid partition column values: ", e); : } : } : } : : /** :* Comparator to allow ordering of partitions by their partition-key values. :*/ : public static final KeyValueComparator KV_COMPARATOR = new KeyValueComparator(); : public static class KeyValueComparator implements Comparator { : @Override : public int compare(FeFsPartition o1, FeFsPartition o2) { : return comparePartitionKeyValues(o1.getPartitionValues(), o2.getPartitionValues()); : } : } : : @VisibleForTesting : public static int comparePartitionKeyValues(List lhs, : List rhs) { : > nit: Could you move these non-static methods to somewhere above the HdfsPar Done http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1453 PS11, Line 1453: "that have deleted rows and complex types. As a workaround you can run a " + : "major compaction."; > Maybe mention that the query references complex types? Done. Full-ACID complex types are coming soon, but it was low effort to add some tests for it so I added. http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1468 PS11, Line 1468: throws AnalysisException { > unused arguement Done http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1471 PS11, Line 1471: hdfsTblRef.getUniqueAlias() > Looks like if the tableRef doesn't have an explict alias, this is the full It works well, added some tests. http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1537 PS11, Line 1537: /*isPartitionKeySc > Could you add some tests for isPartitionKeyScan=true? IIUC, we only scan on Thanks for pointing this out. The code really had a bug, though I needed a bit different statements to hit it. When 'isPartitionKeyScan=true' HdfsScanNode creates only one scan range for each file descriptor. Then the backend scanners only return a single row from each scan range. So to hit the bug I had to issue the following statements: create table my_acid (id int) partitioned by (p int) stored as orc tblproperties('transactional'='true'); insert into my_acid partition(p=0) values (0), (1), (2); insert into my_acid partition(p=1) values (0), (1), (2); delete from my_acid where p = 1 and id = 0; // Run this in Impala. The result should be 1 since partition(p=1) is NOT empty. select max(p) from my_acid; In p=1 we have only one insert delta file which contains (0), (1), and (2), all of them fit into one scan range. So the scanner only returns the first element (0) that we have deleted. Added this example as e2e test. http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1541 PS11, Line 1541: /*isPartitionKeySc > Should we always set this to true to scan all delete rows? I think you meant 'false', and you are right. And we also need to set it to false for the insert delta files as well. http://gerrit.cloudera.org:8080/#/c/16082/11/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test: http://gerrit.cloudera.org:8080/#/c/16082/11/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@714 PS11, Line 714: > Could you add some tests to make sure that the query hints are respected? A Done
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 12: (1 comment) http://gerrit.cloudera.org:8080/#/c/16082/12/tests/custom_cluster/test_local_catalog.py File tests/custom_cluster/test_local_catalog.py: http://gerrit.cloudera.org:8080/#/c/16082/12/tests/custom_cluster/test_local_catalog.py@30 PS12, Line 30: from tests.common.skip import (SkipIfHive2, SkipIfCatalogV2, SkipIfS3, SkipIfABFS, flake8: F401 'tests.common.skip.SkipIfCatalogV2' imported but unused -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 12 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 13 Jul 2020 18:29:34 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Hello Aman Sinha, Quanlong Huang, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16082 to look at the new patch set (#12). Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operations on a table. It achieves it via assigning a unique row-id for each row, and maintaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_001_001_/_0 * full_acid/delta_002_002_/_0 * full_acid/delete_delta_003_003_/_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 --- M common/thrift/CatalogObjects.thrift M common/thrift/CatalogService.thrift M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java M fe/src/main/java/org/apache/impala/catalog/FeFsTable.java M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java M fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java M fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsTable.java M fe/src/main/java/org/apache/impala/catalog/local/MetaProvider.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/util/AcidUtils.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test M tests/custom_cluster/test_local_catalog.py M tests/query_test/test_acid.py 27 files changed, 1,710 insertions(+), 151 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/12 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 12 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 11: (9 comments) http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@1368 PS11, Line 1368: : @Override : public long getWriteId() { : return writeId_; : } : : @Override : public HdfsPartition genInsertDeltaPartition() { : ImmutableList fileDescriptors = !encodedInsertFileDescriptors_.isEmpty() ? : encodedInsertFileDescriptors_ : encodedFileDescriptors_; : return new HdfsPartition.Builder(this) : .setId(id_) : .setFileDescriptors(fileDescriptors) : .build(); : } : : @Override : public HdfsPartition genDeleteDeltaPartition() { : if (encodedDeleteFileDescriptors_.isEmpty()) return null; : return new HdfsPartition.Builder(this) : .setId(id_) : .setFileDescriptors(encodedDeleteFileDescriptors_) : .build(); : } nit: Could you move these non-static methods to somewhere above the HdfsPartition.Builder? Other codes here are static subclass, fields or methods. http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1453 PS11, Line 1453: final String NOT_SUPPORTED_YET = "This query is not supported on full ACID tables " + : "that have deleted rows. As a workaround you can run a major compaction."; Maybe mention that the query references complex types? Not sure how complex it is to add a test for this. It may worths some test coverage on it. Or maybe we'll support reading full-ACID complex types soon? http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1468 PS11, Line 1468: List partitions unused arguement http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1471 PS11, Line 1471: hdfsTblRef.getUniqueAlias() Looks like if the tableRef doesn't have an explict alias, this is the full qualified table name. Could you add some tests to make sure the resolution works as expected? I.e. tests that the full-acid table is used more than once and all occurances don't have explicit alias, e.g. select count(*) from ( select * from functional_orc_def.alltypes_deleted_rows where id % 2 = 0 union all select * from functional_orc_def.alltypes_deleted_rows where id % 2 != 0 ) t; select id from functional_orc_def.alltypes_deleted_rows where id % 2 = 0 limit 10 union all select max(id) from functional_orc_def.alltypes_deleted_rows; http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1537 PS11, Line 1537: isPartitionKeyScan Could you add some tests for isPartitionKeyScan=true? IIUC, we only scan one row per partition when isPartitionKeyScan=true. If the first row from the insert delta doesn't match the first row from the delete delta, we will consider this partition as non-empty, no matter whether there are other delete rows that could match the first row from the insert delta. I think it's worth to add a test like this: create table my_acid (id int) partitioned by (p int) stored as orc tblproperties('transactional'='true'); insert into my_acid partition(p=0) values (0), (1), (2); // (p=1, id=0) may be the first row of insert delta insert into my_acid partition(p=1) values (0), (1), (2); // (p=1, id=2) may be the first row of delete delta delete from my_acid where p = 1 and id = 2; delete from my_acid where p = 1 and id = 1; delete from my_acid where p = 1 and id = 0; // Run this in Impala. The result should be 0 since partition(p=1) is empty. select max(p) from my_acid; http://gerrit.cloudera.org:8080/#/c/16082/11/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1541 PS11, Line 1541: isPartitionKeyScan Should we always set this to true to scan all delete rows? http://gerrit.cloudera.org:8080/#/c/16082/11/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test File
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 11: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6513/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 11 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 07 Jul 2020 15:29:09 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 11: (12 comments) http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java File fe/src/main/java/org/apache/impala/catalog/FeFsTable.java: http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java@376 PS10, Line 376: > Could you add a TODO in this method for IMPALA-9883? Done http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@2010 PS10, Line 2010: getNumFileDescriptors(); > Please change this to getNumFileDescriptors() after we revise its implement Done http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java File fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java: http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java@56 PS10, Line 56: import org.apache.thrift.TException; > nit: unused import Thanks for pointing these out! Done. http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java@255 PS10, Line 255: } > nit: 4 spaces indent Done http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java@409 PS10, Line 409: public ImmutableList getFileDescriptors() { : if (insertFds_.isEmpty()) return fds_; : List ret = Lists.newArrayListWithCapacity( : insertFds_.size() + deleteFds_.size()); : ret.addAll(insertFds_); : ret.addAll(deleteFds_); : return ImmutableList.copyOf(ret); : } : : @Override : public ImmutableList getInsertFileDescriptors() { : return insertFds_; : } : > Need to update these Done http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java File fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java: http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java@133 PS10, Line 133: !fileDescriptors_.isEmpty() || > nit: It's inefficient to construct the list for this. Maybe change to Done http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java@138 PS10, Line 138: > nit: It's inefficient to construct the list for this. Maybe change to Instead of the ternary operator, I rather just sum up the sizes to keep it more readable. http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@45 PS9, Line 45: import org.apache.impala.analysis.NullLiteral; > Looks like IntelliJ is more intelligent in this :D Yeah :D It's a shame vscode fails on such a trivial task. But apart from this it's a great IDE. http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1518 PS9, Line 1518: feTable.getMetaStoreTable().getParameters())); > PS10 looks good to me. I think the insert files and delete files are used s I'm OK with this if you think it's good. Yeah, I was also thinking about merging fileDescriptors_ and insertFileDescriptors_. My only concern is that it would be a bit weird for genDeleteDeltaPartition(), because it would put the delete delta descriptors into the regular descriptors. http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1536 PS10, Line 1536: hdfsTblRef.getDesc(), conjuncts, insertDeltaPartitions, hdfsTblRef, > line too long (93 > 90) Done http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java File fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java: http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java@497 PS10, Line 497: () { > not *Fail* anymore Done
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/16082/11/tests/custom_cluster/test_local_catalog.py File tests/custom_cluster/test_local_catalog.py: http://gerrit.cloudera.org:8080/#/c/16082/11/tests/custom_cluster/test_local_catalog.py@30 PS11, Line 30: from tests.common.skip import (SkipIfHive2, SkipIfCatalogV2, SkipIfS3, SkipIfABFS, flake8: F401 'tests.common.skip.SkipIfCatalogV2' imported but unused -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 11 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 07 Jul 2020 15:01:58 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Hello Aman Sinha, Quanlong Huang, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16082 to look at the new patch set (#11). Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operations on a table. It achieves it via assigning a unique row-id for each row, and maintaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_001_001_/_0 * full_acid/delta_002_002_/_0 * full_acid/delete_delta_003_003_/_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 --- M common/thrift/CatalogObjects.thrift M common/thrift/CatalogService.thrift M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java M fe/src/main/java/org/apache/impala/catalog/FeFsTable.java M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java M fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java M fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsTable.java M fe/src/main/java/org/apache/impala/catalog/local/MetaProvider.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/util/AcidUtils.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test M tests/custom_cluster/test_local_catalog.py M tests/query_test/test_acid.py 27 files changed, 1,460 insertions(+), 146 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/11 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 11 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 10: (9 comments) Thanks for making the change! I'm still looking into the join related logic and tests. http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java File fe/src/main/java/org/apache/impala/catalog/FeFsTable.java: http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java@376 PS10, Line 376: getFilesSample Could you add a TODO in this method for IMPALA-9883? http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@2010 PS10, Line 2010: getFileDescriptors().size() Please change this to getNumFileDescriptors() after we revise its implementation. http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java File fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java: http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java@56 PS10, Line 56: import org.apache.orc.FileMetadata; nit: unused import http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java@255 PS10, Line 255: nit: 4 spaces indent http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java File fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java: http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java@133 PS10, Line 133: !getFileDescriptors().isEmpty() nit: It's inefficient to construct the list for this. Maybe change to return !fileDescriptors_.isEmpty() || !insertFileDescriptors_.isEmpty() || !deleteFileDescriptors_.isEmpty(). http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java@138 PS10, Line 138: getFileDescriptors().size() nit: It's inefficient to construct the list for this. Maybe change to return fileDescriptors_.isEmpty()? insertFileDescriptors_.size() + deleteFileDescriptors_.size() : fileDescriptors_.size(); http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@45 PS9, Line 45: import org.apache.impala.analysis.NullLiteral; > Yeah, I use my IDE to do that for me. Interestingly, even if I delete this Looks like IntelliJ is more intelligent in this :D http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1518 PS9, Line 1518: feTable.getMetaStoreTable().getParameters())); > I implemented it in PS10, but I'm afraid that the change became a bit too i PS10 looks good to me. I think the insert files and delete files are used separately so it makes sense to store them in different fileds. However, It'd be good to ask more feedbacks on this decision. Maybe we can merge fileDescriptors_ and insertFileDescriptors_ to one field (since only one of them is valid at a time) to reduce the complexity. http://gerrit.cloudera.org:8080/#/c/16082/10/tests/custom_cluster/test_local_catalog.py File tests/custom_cluster/test_local_catalog.py: http://gerrit.cloudera.org:8080/#/c/16082/10/tests/custom_cluster/test_local_catalog.py@467 PS10, Line 467: def test_full_acid_support(self): Could you add a test here for reading functional_orc_def.alltypes_deleted_rows in LocalCatalog mode? -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 07 Jul 2020 09:59:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 10: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 06 Jul 2020 21:16:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 10: (2 comments) http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java File fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java: http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java@409 PS10, Line 409: @Override : public ImmutableList getFileDescriptors() { : return fds_; : } : : @Override : public ImmutableList getInsertFileDescriptors() { : return fds_; : } : : @Override : public ImmutableList getDeleteFileDescriptors() { : return fds_; : } Need to update these http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java File fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java: http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java@497 PS10, Line 497: Fail not *Fail* anymore -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 06 Jul 2020 17:07:44 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 10: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6094/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 06 Jul 2020 16:09:08 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 10: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6499/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 06 Jul 2020 16:05:15 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 9: (3 comments) http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@45 PS9, Line 45: import org.apache.impala.analysis.Path.PathType; > nit: keep the import list sorted in groups (usually the IDE will do this fo Yeah, I use my IDE to do that for me. Interestingly, even if I delete this line, and ask VSCode to import PathType for me, it puts the import to this line again... Same with HdfsPartition.FileDescriptor. Seems like if an import path has more components it can confuse vscode. Anyway, fixed it manually :D http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1418 PS9, Line 1418: if (addAcidSlotsIfNeeded(analyzer, hdfsTblRef, partitions)) { > nit: what about merging this if-statement with its outer scope so they are Done http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1518 PS9, Line 1518: // Let's separate insert delta File Descriptors from delete delta FDs. > I think we should separate the file descriptors in catalogd after loading t I implemented it in PS10, but I'm afraid that the change became a bit too invasive. An alternative approach is just to add the genInsertDeltaPartition() and genDeleteDeltaPartition() to the partition classes and those would filter out the unneeded file descs and return a new Partition object. This way the change would be smaller, but we'd still have to generate these insert/delta partitions for each query. What do you think? -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 9 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 06 Jul 2020 15:46:36 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16082/10/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1536 PS10, Line 1536: hdfsTblRef.getDesc(), conjuncts, insertDeltaPartitions, hdfsTblRef, /*aggInfo=*/null, line too long (93 > 90) -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 06 Jul 2020 15:42:35 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Hello Aman Sinha, Quanlong Huang, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16082 to look at the new patch set (#10). Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operations on a table. It achieves it via assigning a unique row-id for each row, and maintaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_001_001_/_0 * full_acid/delta_002_002_/_0 * full_acid/delete_delta_003_003_/_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 --- M common/thrift/CatalogObjects.thrift M common/thrift/CatalogService.thrift M fe/src/main/java/org/apache/impala/catalog/FeCatalogUtils.java M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java M fe/src/main/java/org/apache/impala/catalog/FileMetadataLoader.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/catalog/ParallelFileMetadataLoader.java M fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java M fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsTable.java M fe/src/main/java/org/apache/impala/catalog/local/MetaProvider.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/util/AcidUtils.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test M tests/query_test/test_acid.py 25 files changed, 1,429 insertions(+), 142 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/10 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 9: (3 comments) http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@45 PS9, Line 45: import org.apache.impala.analysis.Path.PathType; nit: keep the import list sorted in groups (usually the IDE will do this for you). http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1418 PS9, Line 1418: if (addAcidSlotsIfNeeded(analyzer, hdfsTblRef, partitions)) { nit: what about merging this if-statement with its outer scope so they are if (isPartitionKeyScan && queryOpts.optimize_partition_key_scans) { ... } else if (addAcidSlotsIfNeeded(analyzer, hdfsTblRef, partitions)) { ... } else { ... } http://gerrit.cloudera.org:8080/#/c/16082/9/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1518 PS9, Line 1518: // Let's separate insert delta File Descriptors from delete delta FDs. I think we should separate the file descriptors in catalogd after loading them from HDFS instead of doing it here for each query. We can introduce two fileds in HdfsPartition: encodedInsertDeltaFileDescriptors_ and encodedDeleteDeltaFileDescriptors_ (and related fields in THdfsPartition and TPartialPartitionInfo). If a partition contains delete deltas, we separate them by setting these two fields and leaving encodedFileDescriptors_ null. We can also introduce two methods for FeFsPartition: genInsertDeltaPartition() and genDeleteDeltaPartition() using HdfsPartition.Builder in this way: public HdfsPartition genInsertDeltaPartition() { return new HdfsPartition.Builder(this) .setFileDescriptors(InsertDeltaFileDescriptors_) .build(); } With this we don't need to remove the "final" marker of encodedFileDescriptors_ and add back the setFileDescriptor() method, which violates our goal to make HdfsPartition immutable. The setFileDescriptor() method may encourage future developers to modify HdfsPartitions in-place in catalogd, which will break IMPALA-7533. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 9 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 01 Jul 2020 06:57:52 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 9: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 9 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 30 Jun 2020 14:02:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6464/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 9 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 30 Jun 2020 09:28:56 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 9: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6078/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 9 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 30 Jun 2020 09:02:53 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 9: Code-Review+1 PS9 is a rebase. Carrying +1. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 9 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 30 Jun 2020 09:02:37 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Hello Aman Sinha, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16082 to look at the new patch set (#9). Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operations on a table. It achieves it via assigning a unique row-id for each row, and maintaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_001_001_/_0 * full_acid/delta_002_002_/_0 * full_acid/delete_delta_003_003_/_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 --- M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/util/AcidUtils.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test M tests/query_test/test_acid.py 15 files changed, 1,148 insertions(+), 89 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/9 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 9 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 8: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6399/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 8 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 23 Jun 2020 16:28:22 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Hello Aman Sinha, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16082 to look at the new patch set (#8). Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operations on a table. It achieves it via assigning a unique row-id for each row, and maintaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_001_001_/_0 * full_acid/delta_002_002_/_0 * full_acid/delete_delta_003_003_/_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 --- M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/util/AcidUtils.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test M tests/query_test/test_acid.py 15 files changed, 1,148 insertions(+), 89 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/8 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 8 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 8: Code-Review+1 (2 comments) Carrying Aman's +1 http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test: http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@58 PS2, Line 58: | row-size=100B cardinality=143 > >> It's because COMPUTE STATS is query-based, so it runs count(*) and coun Thanks for mentioning stats extrapolation. I created IMPALA-9883 for that. Renamed DELETA TABLE to DELETE EVENTS. http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test: http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@286 PS4, Line 286: | | hash predicates: functional_orc_def.alltypes_deleted_rows.month = functional_orc_def.alltypes_deleted_rows-delete-delta.month, functional_orc_def.alltypes_deleted_rows.row__id.bucket = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.bucket, functional_orc_def.alltypes_deleted_rows.row__id.originaltransaction = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.originaltransaction, functional_orc_def.alltypes_deleted_rows.row__id.rowid = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.rowid, functional_orc_def.alltypes_deleted_rows.year = functional_orc_def.alltypes_deleted_rows-delete-delta.year > I see..I missed that the other columns were partitioning cols. I would thin Added a TODO comment about it in SingleNodePlanner. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 8 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 23 Jun 2020 16:06:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 7: Code-Review+1 Couple of minor comments. Overall, this LGTM. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 7 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 22 Jun 2020 17:55:41 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test: http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@58 PS2, Line 58: | row-size=100B cardinality=143 > Yeah I don't think we should pass the conjuncts to the delete deltas. Also >> It's because COMPUTE STATS is query-based, so it runs count(*) and >> count(distinct) queries on a table and stores the results of it. Makes sense..so Compute Stats is also doing the left anti join. Do you know if the sampling clause of Compute Stats will be supported for the ACID table ? i.e a 10% sample on the delta table would need to be anti-joined with the delete_delta. I suppose that should work. In any case, that's a separate investigation. Regarding the naming, yes, 'EVENTS' is more accurate than 'TABLE' since it is not really a table. http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test: http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@286 PS4, Line 286: | | hash predicates: functional_orc_def.alltypes_deleted_rows.month = functional_orc_def.alltypes_deleted_rows-delete-delta.month, functional_orc_def.alltypes_deleted_rows.row__id.bucket = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.bucket, functional_orc_def.alltypes_deleted_rows.row__id.originaltransaction = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.originaltransaction, functional_orc_def.alltypes_deleted_rows.row__id.rowid = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.rowid, functional_orc_def.alltypes_deleted_rows.year = functional_orc_def.alltypes_deleted_rows-delete-delta.year > It only adds the hidden ACID columns + the partitioning columns (year, mont I see..I missed that the other columns were partitioning cols. I would think that not all the columns need to be in the 'hash predicate' .. as long as the ones with high NDV are considered for hashing, the rest could be part of the 'other join predicates' to avoid hashing on too many columns. But this could be a future enhancement. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 7 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 22 Jun 2020 17:54:44 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 7: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6390/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 7 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 22 Jun 2020 16:48:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 6: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/6389/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 22 Jun 2020 16:41:44 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 7: PS 6&7 are only rebases. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 7 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 22 Jun 2020 16:21:57 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Hello Aman Sinha, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16082 to look at the new patch set (#7). Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operations on a table. It achieves it via assigning a unique row-id for each row, and maintaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_001_001_/_0 * full_acid/delta_002_002_/_0 * full_acid/delete_delta_003_003_/_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 --- M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/util/AcidUtils.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test M tests/query_test/test_acid.py 15 files changed, 1,143 insertions(+), 89 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/7 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 7 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Hello Aman Sinha, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16082 to look at the new patch set (#6). Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operations on a table. It achieves it via assigning a unique row-id for each row, and maintaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_001_001_/_0 * full_acid/delta_002_002_/_0 * full_acid/delete_delta_003_003_/_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 --- M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/util/AcidUtils.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test M tests/query_test/test_acid.py 15 files changed, 1,142 insertions(+), 89 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/6 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 5: (6 comments) http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG@9 PS4, Line 9: Hive ACID supports row-level DELETE and UPDATE operations on a table. > nit: spelling 'operatations' Done http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG@11 PS4, Line 11: maintaining two sets of files in a table. The first set is in the > nit: spelling 'maintinaining' Done http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1454 PS2, Line 1454: if (fd.getRelativePath().startsWith("delete_delta_")) { > I think for this patch it is fine to keep it this way. I saw in the doc th Yeah I agree. Will do that during perf analysis. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1551 PS2, Line 1551: HdfsScanNode deltaScanNode = new HdfsScanNode(ctx_.getNextNodeId(), > Actually, the planner does this optimization as part of SanNode.applyCountS You are right, I was thinking about a different optimization in the ORC scanner which is also for count(*): https://github.com/apache/impala/blob/17fd15c6e4981499932c02d541c76757a5fdf87d/be/src/exec/hdfs-orc-scanner.cc#L531-L548 Anyway, now we have tests for count(*) + ACID so if someone implements the Parquet-kind count(*) optimization for ORC in the future, they'll know if they break stg. http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test: http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@58 PS2, Line 58: | row-size=100B cardinality=143 > Regarding the conjuncts, wouldn't it be incorrect to even pass the conjunc Yeah I don't think we should pass the conjuncts to the delete deltas. Also it wouldn't have any effect since the delete delta files contain no real data, only the ACID fields (originaltransaction, rowid, etc.) which serve as a tombstone. To answer the question about cardinality, yes, COMPUTE STATS would show 90. It's because COMPUTE STATS is query-based, so it runs count(*) and count(distinct) queries on a table and stores the results of it. And the "DELETE TABLE" is not a separate table. The delete files are stored under the table's directory in so called "delete delta" directories next to the "delta" directories, e.g.: /test-warehouse/managed/full_acid_part/p=1/delete_delta_003_003_/bucket_0 /test-warehouse/managed/full_acid_part/p=1/delta_001_001_/bucket_0_0 Given that do you think we should rename "DELETE TABLE HASH JOIN" to "DELETE EVENTS HASH JOIN"? (The Hive docs usually refer to the rows in delete delta files as "delete events") http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test: http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@286 PS4, Line 286: | | hash predicates: functional_orc_def.alltypes_deleted_rows.month = functional_orc_def.alltypes_deleted_rows-delete-delta.month, functional_orc_def.alltypes_deleted_rows.row__id.bucket = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.bucket, functional_orc_def.alltypes_deleted_rows.row__id.originaltransaction = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.originaltransaction, functional_orc_def.alltypes_deleted_rows.row__id.rowid = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.rowid, functional_orc_def.alltypes_deleted_rows.year = functional_orc_def.alltypes_deleted_rows-delete-delta.year > This looks like it is adding all the other Slots to hash predicates, apart It only adds the hidden ACID columns + the partitioning columns (year, month). -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 22 Jun
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Hello Aman Sinha, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16082 to look at the new patch set (#5). Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operations on a table. It achieves it via assigning a unique row-id for each row, and maintaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_001_001_/_0 * full_acid/delta_002_002_/_0 * full_acid/delete_delta_003_003_/_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 --- M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/util/AcidUtils.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test M tests/query_test/test_acid.py 15 files changed, 1,147 insertions(+), 89 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/5 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 4: (2 comments) http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG@9 PS4, Line 9: Hive ACID supports row-level DELETE and UPDATE operatations on a table. nit: spelling 'operatations' http://gerrit.cloudera.org:8080/#/c/16082/4//COMMIT_MSG@11 PS4, Line 11: maintinaining two sets of files in a table. The first set is in the nit: spelling 'maintinaining' -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 19 Jun 2020 19:12:43 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 4: (6 comments) http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java File fe/src/main/java/org/apache/impala/planner/JoinNode.java: http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java@88 PS2, Line 88: displayName_ = "DELETE TABLE " + displayName_; > Yeah, I just wanted to somehow mark this JOIN to make it obvious what's goi Sounds good. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1454 PS2, Line 1454: if (fd.getRelativePath().startsWith("delete_delta_")) { > Unfortunately there's no such metadata AFAICT. It could be added during dat I think for this patch it is fine to keep it this way. I saw in the doc that there's a performance analysis phase ..perhaps it could be investigated/improved in that phase. Since this extra cost is added during planning time which typically is in hundreds of ms, It would be interesting to see for typical workloads what is the percentage change in planning time. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1480 PS2, Line 1480: rawPath.add("row__id"); > It's needed and it's added by the above for loop at L1474-L1478. ah, i see..thanks. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1551 PS2, Line 1551: HdfsScanNode deltaScanNode = new HdfsScanNode(ctx_.getNextNodeId(), > Yeah, this makes sense. Actually, the planner does this optimization as part of SanNode.applyCountStarOptimization() which is triggered for Parquet file formats currently..so I suppose in the ORC case it doesn't even matter (although I could foresee doing such optimization for ORC as well). http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test: http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@58 PS2, Line 58: | row-size=100B cardinality=143 > Maybe it's not that wrong. If I use compute stats then the actual cardinali Regarding the conjuncts, wouldn't it be incorrect to even pass the conjuncts to the delete delta scan ? we don't want to prune out deleted records too early. For the cardinality, do you mean if the original cardinality was 100 and 10 rows were deleted, does COMPUTE STATS show 90 ? Even if no compaction was done ? http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test: http://gerrit.cloudera.org:8080/#/c/16082/4/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@286 PS4, Line 286: | | hash predicates: functional_orc_def.alltypes_deleted_rows.month = functional_orc_def.alltypes_deleted_rows-delete-delta.month, functional_orc_def.alltypes_deleted_rows.row__id.bucket = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.bucket, functional_orc_def.alltypes_deleted_rows.row__id.originaltransaction = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.originaltransaction, functional_orc_def.alltypes_deleted_rows.row__id.rowid = functional_orc_def.alltypes_deleted_rows-delete-delta.row__id.rowid, functional_orc_def.alltypes_deleted_rows.year = functional_orc_def.alltypes_deleted_rows-delete-delta.year This looks like it is adding all the other Slots to hash predicates, apart from the hidden slots. Isn't that wrong ? My understanding was the anti join should only join on the ACID specific columns. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 19 Jun 2020 01:27:58 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 4: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6371/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 18 Jun 2020 17:05:54 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 4: (11 comments) Thanks for the comments! http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@673 PS2, Line 673: fileFormatDescriptor_ = other.fileFormatDescriptor_; > Looking at the changes in https://issues.apache.org/jira/browse/IMPALA-9778 Resolved it during the rebase (PS3). http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java File fe/src/main/java/org/apache/impala/planner/HashJoinNode.java: http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java@178 PS2, Line 178: if (!isAcidJoin_ || detailLevel.ordinal() >= TExplainLevel.EXTENDED.ordinal()) { > For the anti join should we not show the join conditions in the Impala quer Done http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java File fe/src/main/java/org/apache/impala/planner/JoinNode.java: http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java@83 PS2, Line 83: // True if this join is used to do the join between insert and delete delta files. > nit: spelling 'deleta' Done http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java@88 PS2, Line 88: displayName_ = "DELETE TABLE " + displayName_; > The 'ACID' prefix for a HashJoin could be misleading to end users since a Yeah, I just wanted to somehow mark this JOIN to make it obvious what's going on, but probably 'ACID' is not the best term to use. I think 'DELETE TABLE HASH JOIN' makes sense. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1453 PS2, Line 1453: for (HdfsPartition.FileDescriptor fd : partition.getFileDescriptors()) { > The casting is redundant since 'partition' is already of type FeFsPartitio Done http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1454 PS2, Line 1454: if (fd.getRelativePath().startsWith("delete_delta_")) { > Is there any indicator in the HMS's partition object that indicates whether Unfortunately there's no such metadata AFAICT. It could be added during data loading, though it'd complicate the code a little, especially when local catalog is being used since it reuses loaded partitions while trying to filter out file descriptors based on an ACID write id list, see IMPALA-9791 for more details. Also, I did some measurements. Created a table with 1000 partitions and 10 files per each partition. Then I've nested this for-loop into another for-loop with 100 iterations to simulate a table with 100K partitions and 1 million files. This whole check was evaluated between 100-200 milliseconds. So for now I'm leaning towards keeping it simple unless you feel strongly against it. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1480 PS2, Line 1480: rawPath.add("row__id"); > The design doc suggests the anti-join is on 4 fields..is the partition id n It's needed and it's added by the above for loop at L1474-L1478. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1551 PS2, Line 1551: HdfsScanNode deltaScanNode = new HdfsScanNode(ctx_.getNextNodeId(), > The 'aggInfo' is meant for the simple 'select count(*) from table' type o Yeah, this makes sense. It didn't cause a bug because the backend only optimizes count(*) if it's a "zero-slot table scan", and we've added ACID slots to materialize. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1591 PS2, Line 1591: insertSlotDesc.getMaterializedPath())) { > Call analyze for the BinaryPredicate. Also, could you add a simple test wi Done. Added planner tests for explain_level 2 and 3. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1605 PS2, Line 1605: /** > nit: spelling 'partitoin' Done http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test File
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Hello Aman Sinha, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16082 to look at the new patch set (#4). Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operatations on a table. It achieves it via assigning a unique row-id for each row, and maintinaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_001_001_/_0 * full_acid/delta_002_002_/_0 * full_acid/delete_delta_003_003_/_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 --- M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/util/AcidUtils.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test M tests/query_test/test_acid.py 15 files changed, 1,147 insertions(+), 89 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/4 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 3: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6365/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 18 Jun 2020 12:50:19 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 3: PS3 is only a rebase -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 18 Jun 2020 12:28:44 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Hello Aman Sinha, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16082 to look at the new patch set (#3). Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operatations on a table. It achieves it via assigning a unique row-id for each row, and maintinaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_001_001_/_0 * full_acid/delta_002_002_/_0 * full_acid/delete_delta_003_003_/_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 --- M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/util/AcidUtils.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test M tests/query_test/test_acid.py 15 files changed, 745 insertions(+), 89 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/3 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 2: (11 comments) Did a first pass..some comments below. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java: http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java@673 PS2, Line 673: protected HdfsPartition(HdfsPartition other) { Looking at the changes in https://issues.apache.org/jira/browse/IMPALA-9778, the HdfsPartition class has been restructured quite a bit.. although it seems your changes should be ok since you are only creating a clone. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java File fe/src/main/java/org/apache/impala/planner/HashJoinNode.java: http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/HashJoinNode.java@178 PS2, Line 178: if (!isAcidJoin_ || detailLevel.ordinal() == TExplainLevel.VERBOSE.ordinal()) { For the anti join should we not show the join conditions in the Impala query profile ? (query profile output is in EXTENDED mode, not VERBOSE). http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java File fe/src/main/java/org/apache/impala/planner/JoinNode.java: http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java@83 PS2, Line 83: // True if this join is used to do the join between insert and deleta delta files. nit: spelling 'deleta' http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/JoinNode.java@88 PS2, Line 88: displayName_ = "ACID " + displayName_; The 'ACID' prefix for a HashJoin could be misleading to end users since a plan operator itself is not ACID in DBMS terms. Could we use something like. 'DELTA TABLE HASH JOIN' (open to other suggestions). http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1453 PS2, Line 1453: FeFsPartition fsPartition = (FeFsPartition)partition; The casting is redundant since 'partition' is already of type FeFsPartition. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1454 PS2, Line 1454: for (HdfsPartition.FileDescriptor fd : fsPartition.getFileDescriptors()) { Is there any indicator in the HMS's partition object that indicates whether any delete was done ? It does seem expensive to check all the file descriptors within a partition since in vast majority of cases nothing may have been deleted or the 'delete_delta' is found only towards the end. However, if such metadata does not exist, then I suppose this is the only alternative. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1480 PS2, Line 1480: String[] acidFields = {"originaltransaction", "bucket", "rowid"}; The design doc suggests the anti-join is on 4 fields..is the partition id not needed in this join ? http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1551 PS2, Line 1551: hdfsTblRef.getDesc(), conjuncts, insertPartitions, hdfsTblRef, aggInfo, The 'aggInfo' is meant for the simple 'select count(*) from table' type of query where the aggregation value can be found from the row count in the metadata. But in this case, since we want to compute this value only after the anti-join has been done, I think it should be null. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1591 PS2, Line 1591: ret.add(new BinaryPredicate( Call analyze for the BinaryPredicate. Also, could you add a simple test with explain_level=3 to check for the join condition ? I don't think I saw one. http://gerrit.cloudera.org:8080/#/c/16082/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1605 PS2, Line 1605:* @return the cloned partitoin nit: spelling 'partitoin' http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test: http://gerrit.cloudera.org:8080/#/c/16082/2/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@58 PS2, Line 58: | row-size=100B cardinality=143 The cardinality estimate for this hash join
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6350/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Wed, 17 Jun 2020 16:16:25 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Hello Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16082 to look at the new patch set (#2). Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operatations on a table. It achieves it via assigning a unique row-id for each row, and maintinaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_001_001_/_0 * full_acid/delta_002_002_/_0 * full_acid/delete_delta_003_003_/_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 --- M fe/src/main/java/org/apache/impala/catalog/FeFsPartition.java M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/catalog/local/LocalFsPartition.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/util/AcidUtils.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test M tests/query_test/test_acid.py 15 files changed, 733 insertions(+), 88 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/2 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16082 ) Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. Patch Set 1: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6329/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 15 Jun 2020 17:36:10 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types)
Zoltan Borok-Nagy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16082 Change subject: IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) .. IMPALA-9859: Full ACID Milestone 4: Part 1 Reading modified tables (primitive types) Hive ACID supports row-level DELETE and UPDATE operatations on a table. It achieves it via assigning a unique row-id for each row, and maintinaining two sets of files in a table. The first set is in the base/delta directories, they contain the INSERTed rows. The second set of files are in the delete-delta directories, they contain the DELETEd rows. (UPDATE operations are implemented via DELETE+INSERT.) In the filesystem it looks like e.g.: * full_acid/delta_001_001_/_0 * full_acid/delta_002_002_/_0 * full_acid/delete_delta_003_003_/_0 During scanning we need to return INSERTed rows minus DELETEd rows. This patch implements it by creating an ANTI JOIN between the INSERT and DELETE sets. It is a planner-only modification. Every HDFS SCAN that scans full ACID tables (that also have deleted rows) are converted to two HDFS SCANs, one for the INSERT deltas, and one for the DELETE deltas. Then a LEFT ANTI HASH JOIN with BROADCAST distribution mode is created above them. Later we can add support for other distribution modes if the performance requires it. E.g. if we have too many deleted rows then probably we are better off with PARTITIONED distribution mode. We could estimate the number of deleted rows by sampling the delete delta files. The current patch only works for primitive types. I.e. we cannot select nested data if the table has deleted rows. Testing: * added planner test * added e2e tests Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 --- M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java M fe/src/main/java/org/apache/impala/planner/HashJoinNode.java M fe/src/main/java/org/apache/impala/planner/JoinNode.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/util/AcidUtils.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test A testdata/workloads/functional-query/queries/QueryTest/full-acid-scans.test M tests/query_test/test_acid.py 13 files changed, 708 insertions(+), 88 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/16082/1 -- To view, visit http://gerrit.cloudera.org:8080/16082 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I15c8feabf40be1658f3dd46883f5a1b2aa5d0659 Gerrit-Change-Number: 16082 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy