Re: Kudu read - performance issue

Todd Lipcon Fri, 11 May 2018 12:49:22 -0700

On Fri, May 11, 2018 at 12:05 PM, Todor Petrov <[email protected]> wrote:


> Hi there,
>
>
>
> I have an interesting performance issue reading from Kudu. Hopefully there
> is a good explanation for it because the difference in the performance is
> quite significant and it puzzles me a lot.
>
>
>
> Basically we have a table with the following schema:
>
>
>
> *Column1, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION*
>
> *Column2, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION*
>
> *…. (a bunch of int32 and int16 columns)*
>
>
>
> *PK is (Column1, Column2)*
>
> *HASH(Column1) PARTITIONS 4*
>
>
>
> The number of records is *~60M*. *~5K* distinct Column1 values. *~1.4M*
> distinct values for Column2.
>
>
>
> All tests are made on one core. I think the hardware specs are not
> important.
>
>
>
> 1)      If we query all data using
>
>
>
> *      val scanner = *
>
> *        kuduClient.getAsyncScannerBuilder(table)*
>
> *
> .addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema,
> ComparisonOp.EQUAL, column1Value)).build()*
>
>
>
> We use 3 scanners in parallel (one query for each unique value of column1).
>
>
>
> All fields from the returned rows are read and some internal structures
> are built.
>
>
>
> In this case, it takes *~40 sec* to load all the data.
>
>
>
> 2)      If we query using “InListPredicate”, then the performance is
> super slow.
>
>
>
> *      val scanner = *
>
> *        kuduClient.getAsyncScannerBuilder(table)*
>
> *
> .addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema,
> ComparisonOp.EQUAL, column1Value))*
>
> *          .addPredicate(KuduPredicate.newInListPredicate(Column2Schema,
> column2Values.asJava)).build()*
>
>
>
> Same as in 1), 3 scanners in parallel, all records are read and some
> in-memory structures are built. This time column2 values are split into a
> bunch of chunks and we send a request for each unique value of column1 and
> each chunk of column2 values.
>

Are you sorting the values of 'column2' before doing the chunking? Kudu
doesn't use indexes for evaluating IN-list predicates except for using the
min(in-list-values) and max(in-list-values). So, if you had for example:

pre-chunk in-list: 1,2,3,4,5,6
chunk 1: col2 IN (1,6)
chunk 2: col2 IN (2,5)
chunk 3: col2 IN (3,4)

then you will actually scan over the middle portion of that table 3 times.

If you sort the in-list before chunking you'll avoid the multiple-scan
effect here.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Kudu read - performance issue

Reply via email to