Hi there,
I have an interesting performance issue reading from Kudu. Hopefully there is a
good explanation for it because the difference in the performance is quite
significant and it puzzles me a lot.
Basically we have a table with the following schema:
Column1, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION
Column2, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION
.... (a bunch of int32 and int16 columns)
PK is (Column1, Column2)
HASH(Column1) PARTITIONS 4
The number of records is ~60M. ~5K distinct Column1 values. ~1.4M distinct
values for Column2.
All tests are made on one core. I think the hardware specs are not important.
1) If we query all data using
val scanner =
kuduClient.getAsyncScannerBuilder(table)
.addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema,
ComparisonOp.EQUAL, column1Value)).build()
We use 3 scanners in parallel (one query for each unique value of column1).
All fields from the returned rows are read and some internal structures are
built.
In this case, it takes ~40 sec to load all the data.
2) If we query using "InListPredicate", then the performance is super slow.
val scanner =
kuduClient.getAsyncScannerBuilder(table)
.addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema,
ComparisonOp.EQUAL, column1Value))
.addPredicate(KuduPredicate.newInListPredicate(Column2Schema,
column2Values.asJava)).build()
Same as in 1), 3 scanners in parallel, all records are read and some in-memory
structures are built. This time column2 values are split into a bunch of chunks
and we send a request for each unique value of column1 and each chunk of
column2 values.
a) If we chunk column2Values into buckets of 50K ids, it takes ~2900 sec
to load everything.
b) If we put all 1.4M records in one request (column2Values), it's even
slower than a)
Can you please explain what InList is doing that might change the performance
so dramatically. As you see in 2b, the number of queries is the same as in 1.
That's really a lot of records, but given these experiments, it seems better to
read by column1 only and then make an in-memory filter by column2. I have
always imagined, that the Kudu driver or server can do that much more
efficiently.
Thanks in advance.
Regards,
Todor