[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Internal Jenkins has submitted this change and it was merged. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. IMPALA-3481: Use Kudu ScanToken API for scan ranges Switches the planner and KuduScanNode to use Kudu's new ScanToken API instead of explicitly constructing scan ranges for all tablets of a table, regardless of whether they were needed. The ScanToken API allows Impala to specify the projected columns and predicates during planning, and Kudu returns a set of 'scan tokens' that represent a scanner for each tablet that needs to be scanned. The scan tokens can be serialized and distributed to the scan nodes, which can then deserialize them into Kudu scanner objects. Upon deserialization, the scan token has all scan parameters already, including the 'pushed down' predicates. Impala no longer needs to send the Kudu predicates to the BE and convert them at the scan node. This change also fixes: 1) IMPALA-4016: Avoid materializing slots only referenced by Kudu conjuncts 2) IMPALA-3874: Predicates are not always pushed to Kudu TODO: Consider additional planning improvements. Testing: Updated the existing tests, verified everything works as expected. Some BE tests no longer make sense and they were removed. TODO: When KUDU-1065 is resolved, add tests that demonstrate pruning. Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Reviewed-on: http://gerrit.cloudera.org:8080/4120 Reviewed-by: Matthew Jacobs Tested-by: Internal Jenkins --- M be/src/exec/kudu-scan-node-test.cc M be/src/exec/kudu-scan-node.cc M be/src/exec/kudu-scan-node.h M be/src/exec/kudu-scanner.cc M be/src/exec/kudu-scanner.h M be/src/exec/kudu-util.cc M be/src/exec/kudu-util.h M be/src/scheduling/simple-scheduler.cc M bin/impala-config.sh M common/thrift/PlanNodes.thrift M fe/pom.xml M fe/src/main/java/com/cloudera/impala/analysis/Expr.java M fe/src/main/java/com/cloudera/impala/analysis/LiteralExpr.java M fe/src/main/java/com/cloudera/impala/catalog/KuduTable.java M fe/src/main/java/com/cloudera/impala/catalog/delegates/KuduDdlDelegate.java M fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java M fe/src/main/java/com/cloudera/impala/util/KuduUtil.java D fe/src/test/java/com/cloudera/impala/planner/KuduPlannerTest.java M fe/src/test/java/com/cloudera/impala/planner/PlannerTest.java M fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java M fe/src/test/java/com/cloudera/impala/util/KuduUtilTest.java M testdata/workloads/functional-planner/queries/PlannerTest/kudu-selectivity.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu-update.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-planner/queries/PlannerTest/tpch-kudu.test 25 files changed, 526 insertions(+), 1,189 deletions(-) Approvals: Matthew Jacobs: Looks good to me, approved Internal Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Internal Jenkins has posted comments on this change. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. Patch Set 6: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Matthew Jacobs has posted comments on this change. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. Patch Set 6: Code-Review+2 rebase, carrying marcel's +2 -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Marcel Kornacker has posted comments on this change. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. Patch Set 5: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Hello Dan Burkert, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4120 to look at the new patch set (#5). Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. IMPALA-3481: Use Kudu ScanToken API for scan ranges Switches the planner and KuduScanNode to use Kudu's new ScanToken API instead of explicitly constructing scan ranges for all tablets of a table, regardless of whether they were needed. The ScanToken API allows Impala to specify the projected columns and predicates during planning, and Kudu returns a set of 'scan tokens' that represent a scanner for each tablet that needs to be scanned. The scan tokens can be serialized and distributed to the scan nodes, which can then deserialize them into Kudu scanner objects. Upon deserialization, the scan token has all scan parameters already, including the 'pushed down' predicates. Impala no longer needs to send the Kudu predicates to the BE and convert them at the scan node. This change also fixes: 1) IMPALA-4016: Avoid materializing slots only referenced by Kudu conjuncts 2) IMPALA-3874: Predicates are not always pushed to Kudu TODO: Consider additional planning improvements. Testing: Updated the existing tests, verified everything works as expected. Some BE tests no longer make sense and they were removed. TODO: When KUDU-1065 is resolved, add tests that demonstrate pruning. Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 --- M be/src/exec/kudu-scan-node-test.cc M be/src/exec/kudu-scan-node.cc M be/src/exec/kudu-scan-node.h M be/src/exec/kudu-scanner.cc M be/src/exec/kudu-scanner.h M be/src/exec/kudu-util.cc M be/src/exec/kudu-util.h M be/src/scheduling/simple-scheduler.cc M bin/impala-config.sh M common/thrift/PlanNodes.thrift M fe/pom.xml M fe/src/main/java/com/cloudera/impala/analysis/Expr.java M fe/src/main/java/com/cloudera/impala/analysis/LiteralExpr.java M fe/src/main/java/com/cloudera/impala/catalog/KuduTable.java M fe/src/main/java/com/cloudera/impala/catalog/delegates/KuduDdlDelegate.java M fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java M fe/src/main/java/com/cloudera/impala/util/KuduUtil.java D fe/src/test/java/com/cloudera/impala/planner/KuduPlannerTest.java M fe/src/test/java/com/cloudera/impala/planner/PlannerTest.java M fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java M fe/src/test/java/com/cloudera/impala/util/KuduUtilTest.java M testdata/workloads/functional-planner/queries/PlannerTest/kudu-selectivity.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu-update.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-planner/queries/PlannerTest/tpch-kudu.test 25 files changed, 526 insertions(+), 1,189 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/4120/5 -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Matthew Jacobs has posted comments on this change.
Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges
..
Patch Set 4:
(2 comments)
http://gerrit.cloudera.org:8080/#/c/4120/4/be/src/exec/kudu-scan-node-test.cc
File be/src/exec/kudu-scan-node-test.cc:
Line 53: class KuduScanNodeTest : public testing::Test {
> good idea, let's leave a todo.
Done. Makes sense.
http://gerrit.cloudera.org:8080/#/c/4120/4/fe/src/test/java/com/cloudera/impala/planner/KuduPlannerTest.java
File fe/src/test/java/com/cloudera/impala/planner/KuduPlannerTest.java:
Line 33: public class KuduPlannerTest extends PlannerTestBase {
> is this very slow?
Shouldn't be. I'll move them into PlannerTest.
--
To view, visit http://gerrit.cloudera.org:8080/4120
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs
Gerrit-Reviewer: Dan Burkert
Gerrit-Reviewer: Dimitris Tsirogiannis
Gerrit-Reviewer: Marcel Kornacker
Gerrit-Reviewer: Matthew Jacobs
Gerrit-Reviewer: Todd Lipcon
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Marcel Kornacker has posted comments on this change.
Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges
..
Patch Set 4:
(3 comments)
http://gerrit.cloudera.org:8080/#/c/4120/4/be/src/exec/kudu-scan-node-test.cc
File be/src/exec/kudu-scan-node-test.cc:
Line 53: class KuduScanNodeTest : public testing::Test {
> I can't say since I didn't add these to begin with, but while we're in the
good idea, let's leave a todo.
the reason i don't like white-box testing is that you end up writing a lot of
test harness code, which can have bugs, and you might end up testing paths
that aren't taken during actual query execution.
http://gerrit.cloudera.org:8080/#/c/4120/4/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:
Line 114: if (batch_done) break;
> To avoid returning Status::OK() again, and in case logic gets added after t
no, please leave as is.
http://gerrit.cloudera.org:8080/#/c/4120/4/fe/src/test/java/com/cloudera/impala/planner/KuduPlannerTest.java
File fe/src/test/java/com/cloudera/impala/planner/KuduPlannerTest.java:
Line 33: public class KuduPlannerTest extends PlannerTestBase {
> I don't know what the original reason was, but now that it's separate it's
is this very slow?
the planner tests typically take something like 40s to run, a few seconds on
top won't make a difference. also, it's good to have everything in one place,
so i don't need to remember to run two planner tests when making planner
changes.
--
To view, visit http://gerrit.cloudera.org:8080/4120
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs
Gerrit-Reviewer: Dan Burkert
Gerrit-Reviewer: Dimitris Tsirogiannis
Gerrit-Reviewer: Marcel Kornacker
Gerrit-Reviewer: Matthew Jacobs
Gerrit-Reviewer: Todd Lipcon
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Matthew Jacobs has posted comments on this change.
Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges
..
Patch Set 4:
(17 comments)
http://gerrit.cloudera.org:8080/#/c/4120/4/be/src/exec/kudu-scan-node-test.cc
File be/src/exec/kudu-scan-node-test.cc:
Line 53: class KuduScanNodeTest : public testing::Test {
> we don't have unit tests for any other exec node. why is this an exception?
I can't say since I didn't add these to begin with, but while we're in the
process of getting our end-to-end tests up it doesn't hurt to have some of this
coverage. I think once our functional tests are in place this can be removed.
http://gerrit.cloudera.org:8080/#/c/4120/4/be/src/exec/kudu-scan-node.cc
File be/src/exec/kudu-scan-node.cc:
Line 247: VLOG(1) << "Thread started: " << name;
> 1 is too chatty. also, please use the macros.
Done
Line 328: VLOG(1) << "Thread done: " << name;
> change
Done
http://gerrit.cloudera.org:8080/#/c/4120/4/be/src/exec/kudu-scan-node.h
File be/src/exec/kudu-scan-node.h:
Line 39: /// are used to retreive the rows for this scan.
> retrieve
Done
Line 120: void ScannerThread(const string& name, const string* initial_token);
> rename to some verb
Done
Line 125: Status ProcessScanToken(KuduScanner* scanner, const string*
scan_token);
> why const*? can this be null?
no, thanks, this should be const&
http://gerrit.cloudera.org:8080/#/c/4120/4/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:
Line 114: if (batch_done) break;
> why break instead of return?
To avoid returning Status::OK() again, and in case logic gets added after the
while() loop. If you think it's simpler returning OK() here I'm happy to change
it.
http://gerrit.cloudera.org:8080/#/c/4120/4/be/src/exec/kudu-scanner.h
File be/src/exec/kudu-scanner.h:
Line 34: /// Wraps a Kudu client scanner to fetches row batches from Kudu. The
Kudu client scanner
> to fetch
Done
Line 114: int cur_kudu_batch_num_scanned_;
> explain
Done
http://gerrit.cloudera.org:8080/#/c/4120/4/fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java
File fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java:
Line 107: try (KuduClient client = new KuduClientBuilder(
> break before 'new' instead
Done
Line 165: // TODO: Consider only using the LEADER replica.
> for better cache locality?
I was thinking it might be better but I don't know if there's any reason that
might be the case. I'll take out this comment unless and we can chat about the
scheduling policies later.
Line 167: TNetworkAddress address = new
TNetworkAddress(replica.getRpcHost(),
> break before new instead
Done
Line 203: for (KuduPredicate predicate: kuduPredicates_) {
> single line
Done
Line 267:* bounds of the result can be evaluated with Expr::GetValue(NULL)).
> this presumable has some side effects. describe them.
Done
Line 349: private static ComparisonOp
getKuduOperator(BinaryPredicate.Operator op) {
> use qualified name for ComparisonOp, easier to identify that it's a "foreig
Done
Line 351: case GT: return ComparisonOp.GREATER;
> indentation
Done
http://gerrit.cloudera.org:8080/#/c/4120/4/fe/src/test/java/com/cloudera/impala/planner/KuduPlannerTest.java
File fe/src/test/java/com/cloudera/impala/planner/KuduPlannerTest.java:
Line 33: public class KuduPlannerTest extends PlannerTestBase {
> could we merge this back into the regular planner test? what was the reason
I don't know what the original reason was, but now that it's separate it's kind
of handy for creating/destroying the KuduClient only for these tests.
--
To view, visit http://gerrit.cloudera.org:8080/4120
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs
Gerrit-Reviewer: Dan Burkert
Gerrit-Reviewer: Dimitris Tsirogiannis
Gerrit-Reviewer: Marcel Kornacker
Gerrit-Reviewer: Matthew Jacobs
Gerrit-Reviewer: Todd Lipcon
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Marcel Kornacker has posted comments on this change. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/4120/4/fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java File fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java: Line 131:* TODO: Remove when the columns are fetched from Kudu directly during analysis. > you're not going to avoid a round trip to the kudu master to get tablet loc those can all be addressed separately, but i don't see a reason to ditch functionality we already have. it's always harder to put it back later. and it would be a special case for kudu anyway. -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Todd Lipcon has posted comments on this change. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/4120/4/fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java File fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java: Line 139: tableSchema.getColumn(colName); > or why not simply have getColumn() return null? that would have been a reasonable decision too, but now that we've released it, changing it is an incompatible API change. In general we try to make our Java API JDBC-like since most enterprise Java programmers are reasonably familiar with JDBC. -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Todd Lipcon has posted comments on this change. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/4120/4/fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java File fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java: Line 131:* TODO: Remove when the columns are fetched from Kudu directly during analysis. > one of our design principles is to avoid round-trips to central servers dur you're not going to avoid a round trip to the kudu master to get tablet locations anyway... and if there's some adversarial workload that's bringing the master to its knees, then your backends will just hit the same issue when they try to open scanners (they also need to talk to the master before talking to a tablet) -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Marcel Kornacker has posted comments on this change. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. Patch Set 4: (2 comments) http://gerrit.cloudera.org:8080/#/c/4120/4/fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java File fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java: Line 131:* TODO: Remove when the columns are fetched from Kudu directly during analysis. > when you "open" a table, it fetches the metadata (columns, etc) from the ma one of our design principles is to avoid round-trips to central servers during plan generation. (i believe your test cluster numbers, but i also believe it's not that hard to create an adversarial workload.) let's remove this todo. Line 139: tableSchema.getColumn(colName); > We're more or less following the pattern of JDBC where trying to get an inv or why not simply have getColumn() return null? we have no special allegiance to jdbc. -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Todd Lipcon has posted comments on this change. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. Patch Set 4: (4 comments) http://gerrit.cloudera.org:8080/#/c/4120/4/be/src/scheduling/simple-scheduler.cc File be/src/scheduling/simple-scheduler.cc: Line 976: scan_range_length = 1000; > can't the fe get that out of the scan token? We still don't have the tablet sizes available to the client yet, it's a TODO item (see jira referenced above) http://gerrit.cloudera.org:8080/#/c/4120/4/fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java File fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java: Line 131:* TODO: Remove when the columns are fetched from Kudu directly during analysis. > does kudu cache all metadata? where would the column info get fetched from? when you "open" a table, it fetches the metadata (columns, etc) from the master. This isn't a fan-out so it should have low tail latency. On one of our test clusters, 99%ile is 492us, mean 84us. Line 139: tableSchema.getColumn(colName); > note to kudu api designers: it's nice not to use exceptions to signal retur We're more or less following the pattern of JDBC where trying to get an invalid column throws an exception rather than returning null (see https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSetMetaData.java#L487 ) That said, maybe catching IllegalArgumentException would be a better choice? We could also add a new 'bool hasColumn(String name)' if you'd like. Line 295: // Cannot push prediates with null literal values > is there a kudu jira for this? there is now: KUDU-1595 -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Marcel Kornacker has posted comments on this change.
Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges
..
Patch Set 4:
(21 comments)
http://gerrit.cloudera.org:8080/#/c/4120/4/be/src/exec/kudu-scan-node-test.cc
File be/src/exec/kudu-scan-node-test.cc:
Line 53: class KuduScanNodeTest : public testing::Test {
we don't have unit tests for any other exec node. why is this an exception?
http://gerrit.cloudera.org:8080/#/c/4120/4/be/src/exec/kudu-scan-node.cc
File be/src/exec/kudu-scan-node.cc:
Line 247: VLOG(1) << "Thread started: " << name;
1 is too chatty. also, please use the macros.
Line 328: VLOG(1) << "Thread done: " << name;
change
http://gerrit.cloudera.org:8080/#/c/4120/4/be/src/exec/kudu-scan-node.h
File be/src/exec/kudu-scan-node.h:
Line 39: /// are used to retreive the rows for this scan.
retrieve
Line 120: void ScannerThread(const string& name, const string* initial_token);
rename to some verb
Line 125: Status ProcessScanToken(KuduScanner* scanner, const string*
scan_token);
why const*? can this be null?
http://gerrit.cloudera.org:8080/#/c/4120/4/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:
Line 114: if (batch_done) break;
why break instead of return?
http://gerrit.cloudera.org:8080/#/c/4120/4/be/src/exec/kudu-scanner.h
File be/src/exec/kudu-scanner.h:
Line 34: /// Wraps a Kudu client scanner to fetches row batches from Kudu. The
Kudu client scanner
to fetch
Line 114: int cur_kudu_batch_num_scanned_;
explain
http://gerrit.cloudera.org:8080/#/c/4120/4/be/src/scheduling/simple-scheduler.cc
File be/src/scheduling/simple-scheduler.cc:
Line 976: scan_range_length = 1000;
can't the fe get that out of the scan token?
http://gerrit.cloudera.org:8080/#/c/4120/4/fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java
File fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java:
Line 107: try (KuduClient client = new KuduClientBuilder(
break before 'new' instead
Line 131:* TODO: Remove when the columns are fetched from Kudu directly
during analysis.
does kudu cache all metadata? where would the column info get fetched from?
Line 139: tableSchema.getColumn(colName);
note to kudu api designers: it's nice not to use exceptions to signal return
values.
is that a sufficient test? what about data types?
Line 165: // TODO: Consider only using the LEADER replica.
for better cache locality?
Line 167: TNetworkAddress address = new
TNetworkAddress(replica.getRpcHost(),
break before new instead
Line 203: for (KuduPredicate predicate: kuduPredicates_) {
single line
Line 267:* bounds of the result can be evaluated with Expr::GetValue(NULL)).
this presumable has some side effects. describe them.
Line 295: // Cannot push prediates with null literal values
is there a kudu jira for this?
Line 349: private static ComparisonOp
getKuduOperator(BinaryPredicate.Operator op) {
use qualified name for ComparisonOp, easier to identify that it's a "foreign"
type that way
Line 351: case GT: return ComparisonOp.GREATER;
indentation
http://gerrit.cloudera.org:8080/#/c/4120/4/fe/src/test/java/com/cloudera/impala/planner/KuduPlannerTest.java
File fe/src/test/java/com/cloudera/impala/planner/KuduPlannerTest.java:
Line 33: public class KuduPlannerTest extends PlannerTestBase {
could we merge this back into the regular planner test? what was the reason for
separating this?
--
To view, visit http://gerrit.cloudera.org:8080/4120
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs
Gerrit-Reviewer: Dan Burkert
Gerrit-Reviewer: Dimitris Tsirogiannis
Gerrit-Reviewer: Marcel Kornacker
Gerrit-Reviewer: Matthew Jacobs
Gerrit-Reviewer: Todd Lipcon
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Dan Burkert has posted comments on this change. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. Patch Set 4: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Matthew Jacobs has uploaded a new patch set (#4). Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. IMPALA-3481: Use Kudu ScanToken API for scan ranges Switches the planner and KuduScanNode to use Kudu's new ScanToken API instead of explicitly constructing scan ranges for all tablets of a table, regardless of whether they were needed. The ScanToken API allows Impala to specify the projected columns and predicates during planning, and Kudu returns a set of 'scan tokens' that represent a scanner for each tablet that needs to be scanned. The scan tokens can be serialized and distributed to the scan nodes, which can then deserialize them into Kudu scanner objects. Upon deserialization, the scan token has all scan parameters already, including the 'pushed down' predicates. Impala no longer needs to send the Kudu predicates to the BE and convert them at the scan node. This change also fixes: 1) IMPALA-4016: Avoid materializing slots only referenced by Kudu conjuncts 2) IMPALA-3874: Predicates are not always pushed to Kudu TODO: Consider additional planning improvements. Testing: Updated the existing tests, verified everything works as expected. Some BE tests no longer make sense and they were removed. TODO: When KUDU-1065 is resolved, add tests that demonstrate pruning. Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 --- M be/src/exec/kudu-scan-node-test.cc M be/src/exec/kudu-scan-node.cc M be/src/exec/kudu-scan-node.h M be/src/exec/kudu-scanner.cc M be/src/exec/kudu-scanner.h M be/src/exec/kudu-util.cc M be/src/exec/kudu-util.h M be/src/scheduling/simple-scheduler.cc M bin/impala-config.sh M common/thrift/PlanNodes.thrift M fe/pom.xml M fe/src/main/java/com/cloudera/impala/analysis/Expr.java M fe/src/main/java/com/cloudera/impala/analysis/LiteralExpr.java M fe/src/main/java/com/cloudera/impala/catalog/KuduTable.java M fe/src/main/java/com/cloudera/impala/catalog/delegates/KuduDdlDelegate.java M fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java M fe/src/main/java/com/cloudera/impala/util/KuduUtil.java M fe/src/test/java/com/cloudera/impala/planner/KuduPlannerTest.java M fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java M fe/src/test/java/com/cloudera/impala/util/KuduUtilTest.java M testdata/workloads/functional-planner/queries/PlannerTest/kudu-selectivity.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu-update.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-planner/queries/PlannerTest/tpch-kudu.test 24 files changed, 521 insertions(+), 1,137 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/4120/4 -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Matthew Jacobs has uploaded a new patch set (#3). Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. IMPALA-3481: Use Kudu ScanToken API for scan ranges Switches the planner and KuduScanNode to use Kudu's new ScanToken API instead of explicitly constructing scan ranges for all tablets of a table, regardless of whether they were needed. The ScanToken API allows Impala to specify the projected columns and predicates during planning, and Kudu returns a set of 'scan tokens' that represent a scanner for each tablet that needs to be scanned. The scan tokens can be serialized and distributed to the scan nodes, which can then deserialize them into Kudu scanner objects. Upon deserialization, the scan token has all scan parameters already, including the 'pushed down' predicates. Impala no longer needs to send the Kudu predicates to the BE and convert them at the scan node. This change also fixes: 1) IMPALA-4016: Avoid materializing slots only referenced by Kudu conjuncts 2) IMPALA-3874: Predicates are not always pushed to Kudu TODO: Consider additional planning improvements. Testing: Updated the existing tests, verified everything works as expected. Some BE tests no longer make sense and they were removed. TODO: When KUDU-1065 is resolved, add tests that demonstrate pruning. Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 --- M be/src/exec/kudu-scan-node-test.cc M be/src/exec/kudu-scan-node.cc M be/src/exec/kudu-scan-node.h M be/src/exec/kudu-scanner.cc M be/src/exec/kudu-scanner.h M be/src/exec/kudu-util.cc M be/src/exec/kudu-util.h M be/src/scheduling/simple-scheduler.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/com/cloudera/impala/analysis/Expr.java M fe/src/main/java/com/cloudera/impala/analysis/LiteralExpr.java M fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java M fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java M testdata/workloads/functional-planner/queries/PlannerTest/kudu-selectivity.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu-update.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-planner/queries/PlannerTest/tpch-kudu.test 17 files changed, 457 insertions(+), 1,088 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/4120/3 -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Matthew Jacobs has posted comments on this change. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. Patch Set 2: BTW, I think this actually has reasonable test coverage with the existing tests to get started. I think it makes sense to follow-up with additional tests for predicate pushdown - once more of the functional schema is loaded (alltypes*). I'll add new tests to demonstrate pruning once KUDU-1065 is resolved. -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Matthew Jacobs has uploaded a new patch set (#2). Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. IMPALA-3481: Use Kudu ScanToken API for scan ranges Switches the planner and KuduScanNode to use Kudu's new ScanToken API instead of explicitly constructing scan ranges for all tablets of a table, regardless of whether they were needed. The ScanToken API allows Impala to specify the projected columns and predicates during planning, and Kudu returns a set of 'scan tokens' that represent a scanner for each tablet that needs to be scanned. The scan tokens can be serialized and distributed to the scan nodes, which can then deserialize them into Kudu scanner objects. Upon deserialization, the scan token has all scan parameters already, including the 'pushed down' predicates. Impala no longer needs to send the Kudu predicates to the BE and convert them at the scan node. This change also fixes: 1) IMPALA-4016: Avoid materializing slots only referenced by Kudu conjuncts 2) IMPALA-3874: Predicates are not always pushed to Kudu TODO: Consider additional planning improvements. Testing: Updated the existing tests, verified everything works as expected. Some BE tests no longer make sense and they were removed. TODO: When KUDU-1065 is resolved, add tests that demonstrate pruning. Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 --- M be/src/exec/kudu-scan-node-test.cc M be/src/exec/kudu-scan-node.cc M be/src/exec/kudu-scan-node.h M be/src/exec/kudu-scanner.cc M be/src/exec/kudu-scanner.h M be/src/exec/kudu-util.cc M be/src/exec/kudu-util.h M be/src/scheduling/simple-scheduler.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/com/cloudera/impala/analysis/Expr.java M fe/src/main/java/com/cloudera/impala/analysis/LiteralExpr.java M fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java M fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java M testdata/workloads/functional-planner/queries/PlannerTest/kudu-selectivity.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu-update.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-planner/queries/PlannerTest/tpch-kudu.test 17 files changed, 434 insertions(+), 1,080 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/4120/2 -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Matthew Jacobs has posted comments on this change.
Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges
..
Patch Set 1:
(13 comments)
Thanks for the feedback. I made changes/follow-up comments for everything
except for how we do the plan output printing, since Dan is going to provide a
printing fn for us.
http://gerrit.cloudera.org:8080/#/c/4120/1/be/src/exec/kudu-scan-node.cc
File be/src/exec/kudu-scan-node.cc:
PS1, Line 225: range
> GetNextScanToken returns a ... "range" :). I think we should clean this a b
Done
http://gerrit.cloudera.org:8080/#/c/4120/1/be/src/exec/kudu-scan-node.h
File be/src/exec/kudu-scan-node.h:
PS1, Line 35: tablet
> Hm, maybe I am confused but a server can have multiple tablets that need to
Done
PS1, Line 76: string
> not sure if y'all follow the same style guide, but this should be std::stri
Done
Line 79: int next_scan_token_idx_;
> I may have missed something, but it looks like you could just use scan_toke
I started changing this, but I think it's nice to just be passing string*
around as the tokens. I could still use this as a stack if I store them
somewhere else (e.g. a mempool), but not sure if it's worth it.
PS1, Line 125: ranges
> tokens and ranges are sort of used interchangeably. Maybe we should keep on
Done
http://gerrit.cloudera.org:8080/#/c/4120/1/be/src/exec/kudu-scanner.h
File be/src/exec/kudu-scanner.h:
PS1, Line 43: 'range'
> update comment?
Done
http://gerrit.cloudera.org:8080/#/c/4120/1/fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java
File fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java:
PS1, Line 64: (<=, >=, ==)
> Comment needs to be updated.
Done
PS1, Line 77: // TODO use the portion of this list that pertains to keys to do
partition pruning.
> I think we can remove this TODO now, no?
Done
Line 95: kuduTable_.getKuduMasterAddresses()).build()) {
> nit: indentation is a bit weird, maybe it's Impala style though
Done
PS1, Line 133: e master
> I think you mean LEADER replica
Done
PS1, Line 233: <=, >=, ==
> Update comment
Done
PS1, Line 269: column
> has it already been validated that the slot refs only refer to columns that
Yeah, random exceptions might not come back nicely. I added a validate method
which happens before this and throws a more specific error.
PS1, Line 264: if (!(predicate.getChild(0) instanceof SlotRef)) return null;
: if (!predicate.getChild(1).isLiteral()) return null;
:
: SlotRef ref = (SlotRef)predicate.getChild(0);
: String colName = ref.getDesc().getColumn().getName();
: ColumnSchema column = table.getSchema().getColumn(colName);
: ComparisonOp op =
getKuduOperator(((BinaryPredicate)predicate).getOp());
: if (op == null) return null;
> How about putting all the logic of checking if a predicate can be pushed to
Done
--
To view, visit http://gerrit.cloudera.org:8080/4120
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs
Gerrit-Reviewer: Dan Burkert
Gerrit-Reviewer: Dimitris Tsirogiannis
Gerrit-Reviewer: Matthew Jacobs
Gerrit-Reviewer: Todd Lipcon
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Matthew Jacobs has posted comments on this change. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/4120/1/fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java File fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java: Line 297: result.append(Arrays.toString(upperBound.toByteArray())); > You can build against a SNAPSHOT java client, right? In that case we can ge Yes we can, and that would be great. Lemme know if you want me to look at a CR. -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Todd Lipcon has posted comments on this change. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/4120/1/fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java File fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java: Line 297: result.append(Arrays.toString(upperBound.toByteArray())); > Yes, a toString() method would be very handy. Shall I file a JIRA? In the m You can build against a SNAPSHOT java client, right? In that case we can get you an improved API with really fast turnaround. Would rather do that (eg tomorrow) than check in usage of private/shaded APIs. -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Matthew Jacobs has posted comments on this change. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. Patch Set 1: (3 comments) Thanks for the feedback. Just responding to the plan info issue now. Will fix everything else later. http://gerrit.cloudera.org:8080/#/c/4120/1//COMMIT_MSG Commit Message: Line 36: work), add TPC-DS tests that demonstrate pruning. > I think the pruning isn't quite working on the Kudu side quite yet. You can Ah ok, thanks. http://gerrit.cloudera.org:8080/#/c/4120/1/fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java File fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java: PS1, Line 281: KUDU KEYRANGE > Moreover, I think the important thing here isn't the partition key range, b @Dimitris, yup. @Todd/Dan, yeah, we'd ideally like to get a sense of what data is going to be scanned. I can change the name for now. Line 297: result.append(Arrays.toString(upperBound.toByteArray())); > maybe we should provide you a nicer toString() type method you can use? In Yes, a toString() method would be very handy. Shall I file a JIRA? In the meantime I guess we can keep this though, assuming we switch over to using a new toString() method or such before we ship- seem reasonable? We just need some way to verify the plans. -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Dan Burkert has posted comments on this change. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. Patch Set 1: (3 comments) ScanToken usage LGTM. I agree with Todd that we should provide some more debug printing tools if you all need it. http://gerrit.cloudera.org:8080/#/c/4120/1/be/src/exec/kudu-scan-node.h File be/src/exec/kudu-scan-node.h: PS1, Line 76: string not sure if y'all follow the same style guide, but this should be std::string here and below. Line 79: int next_scan_token_idx_; I may have missed something, but it looks like you could just use scan_tokens_ as a stack and pop tokens off it for each call to GetNextScanToken. http://gerrit.cloudera.org:8080/#/c/4120/1/fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java File fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java: PS1, Line 281: KUDU KEYRANGE > KEYRANGE isn't really accurate -- this makes it seem like it's a primary ke Moreover, I think the important thing here isn't the partition key range, but what tablet it's scanning. I recommend just printing the tablet ID. -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Todd Lipcon has posted comments on this change.
Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges
..
Patch Set 1:
(7 comments)
http://gerrit.cloudera.org:8080/#/c/4120/1//COMMIT_MSG
Commit Message:
Line 36: work), add TPC-DS tests that demonstrate pruning.
I think the pruning isn't quite working on the Kudu side quite yet. You can
follow https://issues.apache.org/jira/browse/KUDU-1065
http://gerrit.cloudera.org:8080/#/c/4120/1/fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java
File fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java:
Line 95: kuduTable_.getKuduMasterAddresses()).build()) {
nit: indentation is a bit weird, maybe it's Impala style though
PS1, Line 133: e master
I think you mean LEADER replica
PS1, Line 269: column
has it already been validated that the slot refs only refer to columns that
exist? eg if the metastore has some outdated information about the columns,
this will throw an IllegalArgumentException. Would that propagate up to the
user in a reasonable way?
http://gerrit.cloudera.org:8080/#/c/4120/1/fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java
File fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java:
PS1, Line 39: import org.kududb.client.Client.ScanTokenPB;
: import org.kududb.client.shaded.com.google.protobuf.ByteString;
: import
org.kududb.client.shaded.com.google.protobuf.InvalidProtocolBufferException;
yea, this is pretty evil, there's no compatibility guarantee about these APIs
so would strongly discourage using them
PS1, Line 281: KUDU KEYRANGE
> Is this part used in any of the planner tests? I don't remember seeing it a
KEYRANGE isn't really accurate -- this makes it seem like it's a primary key
range, where in fact it's a partition key range (which may not be one and the
same)
Line 297: result.append(Arrays.toString(upperBound.toByteArray()));
maybe we should provide you a nicer toString() type method you can use? In
addition to being an evil use of our internal APIs, this isn't particularly
useful for users to look at. You're just going to see something like
[0,0,0,13,5,2,4,3,...] which is hardly friendly.
--
To view, visit http://gerrit.cloudera.org:8080/4120
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: comment
Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs
Gerrit-Reviewer: Dimitris Tsirogiannis
Gerrit-Reviewer: Todd Lipcon
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Dimitris Tsirogiannis has posted comments on this change. Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. Patch Set 1: (9 comments) Overall it looks good. At a high level, 'ranges' and 'tokens' seem to be used interchangeably which is a bit confusing. Also, I think we need more planner tests that exercise the logic of pushing predicates to Kudu. http://gerrit.cloudera.org:8080/#/c/4120/1/be/src/exec/kudu-scan-node.cc File be/src/exec/kudu-scan-node.cc: PS1, Line 225: range GetNextScanToken returns a ... "range" :). I think we should clean this a bit. http://gerrit.cloudera.org:8080/#/c/4120/1/be/src/exec/kudu-scan-node.h File be/src/exec/kudu-scan-node.h: PS1, Line 35: tablet Hm, maybe I am confused but a server can have multiple tablets that need to be scanned, correct? Don't we create one scanner per node? PS1, Line 125: ranges tokens and ranges are sort of used interchangeably. Maybe we should keep one? http://gerrit.cloudera.org:8080/#/c/4120/1/be/src/exec/kudu-scanner.h File be/src/exec/kudu-scanner.h: PS1, Line 43: 'range' update comment? http://gerrit.cloudera.org:8080/#/c/4120/1/fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java File fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java: PS1, Line 64: (<=, >=, ==) Comment needs to be updated. PS1, Line 77: // TODO use the portion of this list that pertains to keys to do partition pruning. I think we can remove this TODO now, no? PS1, Line 233: <=, >=, == Update comment PS1, Line 264: if (!(predicate.getChild(0) instanceof SlotRef)) return null; : if (!predicate.getChild(1).isLiteral()) return null; : : SlotRef ref = (SlotRef)predicate.getChild(0); : String colName = ref.getDesc().getColumn().getName(); : ColumnSchema column = table.getSchema().getColumn(colName); : ComparisonOp op = getKuduOperator(((BinaryPredicate)predicate).getOp()); : if (op == null) return null; How about putting all the logic of checking if a predicate can be pushed to Kudu into a single function? That would include this block and also some parts from extractKuduConjuncts() fn (e.g. L242-247). http://gerrit.cloudera.org:8080/#/c/4120/1/fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java File fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java: PS1, Line 281: KUDU KEYRANGE Is this part used in any of the planner tests? I don't remember seeing it anywhere. -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs Gerrit-Reviewer: Dimitris Tsirogiannis Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-3481: Use Kudu ScanToken API for scan ranges
Matthew Jacobs has uploaded a new change for review. http://gerrit.cloudera.org:8080/4120 Change subject: IMPALA-3481: Use Kudu ScanToken API for scan ranges .. IMPALA-3481: Use Kudu ScanToken API for scan ranges Switches the planner and KuduScanNode to use Kudu's new ScanToken API instead of explicitly constructing scan ranges for all tablets of a table, regardless of whether they were needed. The ScanToken API allows Impala to specify the projected columns and predicates during planning, and Kudu returns a set of 'scan tokens' that represent a scanner for each tablet that needs to be scanned, pruning the tablets that will not be necessary. The scan tokens can be serialized and distributed to the scan nodes, which can then deserialize them into Kudu scanner objects. Upon deserialization, the scan token has all scan parameters already, including the 'pushed down' predicates. Impala no longer needs to send the Kudu predicates to the BE and convert them at the scan node. This change also fixes: 1) IMPALA-4016: Avoid materializing slots only referenced by Kudu conjuncts 2) IMPALA-3874: Predicates are not always pushed to Kudu TODO: Consider additional planning improvements. Testing: Updated the existing tests, verified everything works as expected. Some BE tests no longer make sense and they were removed. TODO: When the TPC-DS schema is loaded for Kudu (follow-up work), add TPC-DS tests that demonstrate pruning. Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 --- M be/src/exec/kudu-scan-node-test.cc M be/src/exec/kudu-scan-node.cc M be/src/exec/kudu-scan-node.h M be/src/exec/kudu-scanner.cc M be/src/exec/kudu-scanner.h M be/src/exec/kudu-util.cc M be/src/exec/kudu-util.h M be/src/scheduling/simple-scheduler.cc M common/thrift/PlanNodes.thrift M fe/src/main/java/com/cloudera/impala/analysis/Expr.java M fe/src/main/java/com/cloudera/impala/analysis/LiteralExpr.java M fe/src/main/java/com/cloudera/impala/planner/KuduScanNode.java M fe/src/test/java/com/cloudera/impala/planner/PlannerTestBase.java M testdata/workloads/functional-planner/queries/PlannerTest/kudu-selectivity.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu-update.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-planner/queries/PlannerTest/tpch-kudu.test 17 files changed, 337 insertions(+), 993 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/4120/1 -- To view, visit http://gerrit.cloudera.org:8080/4120 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Matthew Jacobs
