Hi there,

I have an interesting performance issue reading from Kudu. Hopefully there is a 
good explanation for it because the difference in the performance is quite 
significant and it puzzles me a lot.

Basically we have a table with the following schema:

Column1, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION
Column2, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION
.... (a bunch of int32 and int16 columns)

PK is (Column1, Column2)
HASH(Column1) PARTITIONS 4

The number of records is ~60M. ~5K distinct Column1 values. ~1.4M distinct 
values for Column2.

All tests are made on one core. I think the hardware specs are not important.


1)      If we query all data using



      val scanner =

        kuduClient.getAsyncScannerBuilder(table)

          .addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema, 
ComparisonOp.EQUAL, column1Value)).build()



We use 3 scanners in parallel (one query for each unique value of column1).



All fields from the returned rows are read and some internal structures are 
built.



In this case, it takes ~40 sec to load all the data.



2)      If we query using "InListPredicate", then the performance is super slow.



      val scanner =

        kuduClient.getAsyncScannerBuilder(table)

          .addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema, 
ComparisonOp.EQUAL, column1Value))

          .addPredicate(KuduPredicate.newInListPredicate(Column2Schema, 
column2Values.asJava)).build()



Same as in 1), 3 scanners in parallel, all records are read and some in-memory 
structures are built. This time column2 values are split into a bunch of chunks 
and we send a request for each unique value of column1 and each chunk of 
column2 values.



a)      If we chunk column2Values into buckets of 50K ids, it takes ~2900 sec 
to load everything.

b)      If we put all 1.4M records in one request (column2Values), it's even 
slower than a)

Can you please explain what InList is doing that might change the performance 
so dramatically. As you see in 2b, the number of queries is the same as in 1. 
That's really a lot of records, but given these experiments, it seems better to 
read by column1 only and then make an in-memory filter by column2. I have 
always imagined, that the Kudu driver or server can do that much more 
efficiently.

Thanks in advance.

Regards,
Todor

Reply via email to