Hi Franco,

Great question, and I think this gets towards a deeper use-case that Kudu
could really excel at, but currently doesn't have the full set of required
features to support.  To your original question: you've pretty much covered
all of the bases.  Kudu doesn't have an efficient way to search backwards,
currently.  I agree with all of the conclusions you drew regarding manual
binary search; it's probably feasible, but obviously not scalable if you
have to do it for every write, or maybe even every batch.  If you only have
to do it every time you restart your single writer process, then maybe it's
not so bad.

As far as the bigger picture, I think this is a symptom of the fact that
Kudu doesn't currently make it easy to work with data that doesn't have a
natural primary key.  It sounds like for your usecase having an
auto-increment column would be ideal.  Auto-increment is tricky when
combined with sharded/partitioned tables, but I think it's solvable. One
thing we've come up against as well are data sets where many columns need
to be shoved into the PK in order to make it unique.  There's definitely
overhead in having big string multi-column PKs in Kudu, so having an
auto-increment feature would be a big win for these datasets as well.
KUDU-1945 <https://issues.apache.org/jira/browse/KUDU-1945> is more or less
our tracking ticket for this idea.

One thing I'll mention, mostly as a 'food for thought' type idea is how the
experimental kudu-ts
<https://github.com/danburkert/kudu-ts/tree/master/core> system works.
 kudu-ts supports multi-attribute indexing, and in order to do so it builds
an index table which is effectively a linear-probing open-addressed hash
map inside a Kudu table.  The indexed attributes are hashed, and then
inserted into the 'tagsets' table with the hash value as the PK.  If
there's a collision on insert it does linear probing. Even though this is a
slow on writes, it works out really well for kudu's read patterns; you just
do a scan with a limit of 10 or 20, and it's virtually guaranteed that the
row you're looking for will be in that result set.  Now, I only bring this
up because it's an interesting thought experiment, we haven't done any
scale tests, ymmv, etc etc.

Sorry I don't have a great solution for your usecase, but hopefully with
some additional features Kudu will eventually get there.  I'd be very
interested to hear your results if your pursue the binary search solution.

- Dan

On Wed, Dec 13, 2017 at 6:54 PM, Franco Venturi <fvent...@comcast.net>
wrote:

> Scenario: Kudu table with a primary key (PK) made of just one column of
> type INT64 (i.e. Java long)
>
>
> Problem: Compute max(primary key) in an efficient way
>
>
>
> Discussion:
> At work we have a Kudu table with billions of rows; its primary key is
> made of just one column of type INT64, this column contains a sequence
> number increasing from 1 to N (number of rows).
> For this discussion I am going to assume that the design of this table is
> fixed and can't be changed.
>
>
> I'd like to find an efficient way to compute the maximum of the primary
> key, so I know which sequence number I can start from when inserting
> another batch of rows in this table.
>
>
> Notice that the dual problem, i.e. computing min(PK), can be solved
> trivially in O(1) time, due to the fact that in Kudu all the rows within a
> tablet are kept in primary key sorted order (see here:
> https://www.cloudera.com/documentation/enterprise/
> latest/topics/kudu_schema_design.html#concept_f4d_jsy_1z) - a simple way
> to do that is to get a list of KuduScanTokens from a KuduScanTokenBuilder
> (with setProjectedColumnIndexes set to return just the 0th column, and
> limit set to 1), read the first row returned on each tablet in the list,
> and compute the minimum of these values.
>
>
> In the case of finding the maximum value instead, the simplest approach
> would be to run a full table scan (using a KuduScanner or a set of
> KuduScanTokens in parallel), and find the maximum among all the values (or
> the maximum among the last values returned by each tablet). This approach
> hoewever scales as O(N) and therefore takes a while to run when the table
> has several billion rows (of course with setProjectedColumnIndexes set to
> return just the primary key column).
>
>
> I also read the API documentation and the code to see if Kudu offered a
> way to scan the rows of a table backwards (i.e. in decreasing order), but I
> couldn't find it (but I would be glad to be proven wrong on this one).
>
>
> After some thinking I came up with this algorithm that uses the
> lowerBound() method in KuduScannerBuilder and bisection: given an interval
> of possible values where the primary key maximum could be, I create a
> KuduScannerBuilder with a lowerBound set to a value that is half way
> between the two ends of the interval; if that KuduScanner returns at least
> 1 row (which can be checked in O(1) time), then the maximum value must be
> somewhere between the half way point and the upper end; on the other hand
> if that KuduScanner returns no rows, then the maximum value must be in the
> lower half.
> I then repeat the same process of bisection to the half interval selected
> above and so on by using the standard bisection algorithm. As you can
> easily see, this algorithm is about O(log2(N)).
> I also added a couple of additional tricks to my code:
>       - since it is very unlikely that my maximum is in the range of the
> trillions of quadrillions (since it is just a sequential number, i.e. it is
> the same as the number of rows I have), I run a first
> lowerBound()+bisection loop to determine the highest '1' bit in the maximum
> (i.e. I start with 1<<32, see if there's any row above that lower bound, if
> there's none, trying again with 1<<16, and so on)
>       - since I imagine that creating KuduScanners (and closing them
> afterwards) is an "expensive" operation compared to just scanning a few
> rows, when the bisection interval reaches some predefined value (in my
> tests 128 rows), I switch to just a regular scan of this final interval and
> find the maximum among all the values found in this small interval in the
> same way as the standard approach described above.
>
>
> In conclusion the lowerBound()+bisection algorithm is definitely efficient
> (and a few tests I ran showed that), but it seems very complicated (more
> than it should perhaps), so I was wondering if I am missing something
> obvious, and if any of you had to solve a similar problem in the past, how
> did you do it?
>
>
> Also I haven't looked at the source code for Impala, but I would like to
> know if Impala uses any trick (or undocumented rpc call/parameter) when it
> comes to computing something like this or scanning the rows of a tablet
> backwards.
>
>
> Franco
>

Reply via email to