[kudu-CR] WIP KUDU-16 pt 3: add per-scan-token limits
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/9923 ) Change subject: WIP KUDU-16 pt 3: add per-scan-token limits .. Patch Set 1: Revisiting this. I was a bit surprised to see the limit used in the Java client, but not in the C++ client. Given the limit is already defined and used on the scan token it seams like plumbing it through cant hurt and would make the client implementations more consistent. -- To view, visit http://gerrit.cloudera.org:8080/9923 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I655f07f10a9a99e9766402729361de97e929136a Gerrit-Change-Number: 9923 Gerrit-PatchSet: 1 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Grant Henke Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 22 Jun 2020 15:23:51 + Gerrit-HasComments: No
[kudu-CR] WIP KUDU-16 pt 3: add per-scan-token limits
Andrew Wong has abandoned this change. ( http://gerrit.cloudera.org:8080/9923 ) Change subject: WIP KUDU-16 pt 3: add per-scan-token limits .. Abandoned Deemed not that useful -- To view, visit http://gerrit.cloudera.org:8080/9923 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: abandon Gerrit-Change-Id: I655f07f10a9a99e9766402729361de97e929136a Gerrit-Change-Number: 9923 Gerrit-PatchSet: 1 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] WIP KUDU-16 pt 3: add per-scan-token limits
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9923 ) Change subject: WIP KUDU-16 pt 3: add per-scan-token limits .. Patch Set 1: > Patch Set 1: > > (1 comment) After some discussion with Thomas from the Impala team, I think the consensus is that this API isn't that useful. Right now, Impala can't really take advantage of this top-level LimitPerToken API because the scan limit isn't known at the time of token building (although there are likely ways around this through updating Impala's source). The ideal API would be one where a limit could be set per grouop of tokens and the tokens would be able to coordinate with each other to satisfy the exact limit. This is difficult to do, so it's probably off the table for now. That said, the limit is definitely known during the re-hydrating call DeserializeIntoScanner(), so Impala should be able to use the API defined in [725c6ce6391b42331e4ab4210ae4117779cba4a2](https://github.com/apache/kudu/commit/725c6ce6391b42331e4ab4210ae4117779cba4a2) in a similar way to that in this patch, i.e. specifying the overall limit per scanner such that l * t rows are returned across all scanners and returning the first l, since this is still a perf win. -- To view, visit http://gerrit.cloudera.org:8080/9923 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I655f07f10a9a99e9766402729361de97e929136a Gerrit-Change-Number: 9923 Gerrit-PatchSet: 1 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 09 Apr 2018 21:43:58 + Gerrit-HasComments: No
[kudu-CR] WIP KUDU-16 pt 3: add per-scan-token limits
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9923 ) Change subject: WIP KUDU-16 pt 3: add per-scan-token limits .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/9923/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9923/1//COMMIT_MSG@19 PS1, Line 19: sure this is actually a useful API or the one that we want to expose. > We chatted about this on Slack yesterday, but my thoughts are that we shoul I agree with both of your points; if a token is a logical concept that represents some optimization on a query plan, limiting probably doesn't make much sense (unless we're doing something fancy like sampling and joining, but AFAIK we're [Impala/Spark] not). That said, as an optimization for how Impala or some other parallel-scanning service might handles limits, it still might be a convenient API to have. I took a look at Spark and it seems like limiting is handled at a different layer of abstraction than the one at which KuduRDD sits (something like https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/sql/execution/Limit.html). I'll have to dig a bit more, still not sure how KuduRDD fits in. As for Impala, yeah it'd be helpful to get some feedback from the Impala devs. I pinged Thomas since he worked on runtime filters and runtime limits seems in a similar ballpark. In the meantime, I've filed IMPALA-6821 and pointed it at this CR. -- To view, visit http://gerrit.cloudera.org:8080/9923 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I655f07f10a9a99e9766402729361de97e929136a Gerrit-Change-Number: 9923 Gerrit-PatchSet: 1 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Sat, 07 Apr 2018 06:58:30 + Gerrit-HasComments: Yes
[kudu-CR] WIP KUDU-16 pt 3: add per-scan-token limits
Dan Burkert has posted comments on this change. ( http://gerrit.cloudera.org:8080/9923 ) Change subject: WIP KUDU-16 pt 3: add per-scan-token limits .. Patch Set 1: (1 comment) Code LGTM, but I could go either way on whether to include this at all. http://gerrit.cloudera.org:8080/#/c/9923/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9923/1//COMMIT_MSG@19 PS1, Line 19: sure this is actually a useful API or the one that we want to expose. > yea, Im not sure this API provides much value beyond the caller just specif We chatted about this on Slack yesterday, but my thoughts are that we should definitely support setting a limit after re-hydrating the token, but providing this API as well might be useful, and it seems like it will be low maintenance overhead. I checked out the Spark integration to see if we could take advantage of it their, but it doesn't look like the RelationProvider interface exposes any kind of Limit hooks, so I guess I don't have any concrete examples where it will be useful. -- To view, visit http://gerrit.cloudera.org:8080/9923 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I655f07f10a9a99e9766402729361de97e929136a Gerrit-Change-Number: 9923 Gerrit-PatchSet: 1 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Thu, 05 Apr 2018 18:09:53 + Gerrit-HasComments: Yes
[kudu-CR] WIP KUDU-16 pt 3: add per-scan-token limits
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/9923 ) Change subject: WIP KUDU-16 pt 3: add per-scan-token limits .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/9923/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9923/1//COMMIT_MSG@19 PS1, Line 19: sure this is actually a useful API or the one that we want to expose. yea, Im not sure this API provides much value beyond the caller just specifying the limit when they hydrate the token before they call Open() on the scan, which has the same net effect. In general I feel like the APIs we provide for scan tokens should be things that offer some optimization opportunity at that "planning" stage. For example, it's beneficial to express predicates because it may reduce the number of generated tokens based on partition pruning, but for something like limit, especially given the odd semantics that you described, it's less obvious. A further reason it might be better for the user to handle the limits is that, at least in the case of Impala, a single backend is responsible for processing several tokens (in parallel, even), and so they'll still need to do their own limit tracking there to early-stop scanning after the limit is exhausted. Maybe worth chatting with the Impala folks to see whether they think they would prefer to use an API like this to attach the limit to the scan tokens or whether or would be more useful for them to manage the limit themselves? -- To view, visit http://gerrit.cloudera.org:8080/9923 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I655f07f10a9a99e9766402729361de97e929136a Gerrit-Change-Number: 9923 Gerrit-PatchSet: 1 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 04 Apr 2018 19:23:10 + Gerrit-HasComments: Yes
[kudu-CR] WIP KUDU-16 pt 3: add per-scan-token limits
Andrew Wong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9923 Change subject: WIP KUDU-16 pt 3: add per-scan-token limits .. WIP KUDU-16 pt 3: add per-scan-token limits This patch implements scan-limiting on a per-token basis. Kudu only exposes a builder API for batches of tokens. These tokens get sent out and instantiate scanners with any specified limit. This means that if the builder limit is l, with t tokens, the number of rows returned will be at most l * t. This is preferable to splitting the limit across tokens, e.g. by limiting each token to l / t rows, as this may yield fewer than l rows across tokens, when more than l rows actually exist, which doesn't seem desirable. WIP Maybe we should expose setting individual token limits. Also I'm not sure this is actually a useful API or the one that we want to expose. Change-Id: I655f07f10a9a99e9766402729361de97e929136a --- M src/kudu/client/client.cc M src/kudu/client/client.h M src/kudu/client/scan_token-internal.cc M src/kudu/client/scan_token-test.cc 4 files changed, 37 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/9923/1 -- To view, visit http://gerrit.cloudera.org:8080/9923 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I655f07f10a9a99e9766402729361de97e929136a Gerrit-Change-Number: 9923 Gerrit-PatchSet: 1 Gerrit-Owner: Andrew Wong