On Fri, May 11, 2018 at 12:05 PM, Todor Petrov <[email protected]> wrote:
> Hi there, > > > > I have an interesting performance issue reading from Kudu. Hopefully there > is a good explanation for it because the difference in the performance is > quite significant and it puzzles me a lot. > > > > Basically we have a table with the following schema: > > > > *Column1, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION* > > *Column2, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION* > > *…. (a bunch of int32 and int16 columns)* > > > > *PK is (Column1, Column2)* > > *HASH(Column1) PARTITIONS 4* > > > > The number of records is *~60M*. *~5K* distinct Column1 values. *~1.4M* > distinct values for Column2. > > > > All tests are made on one core. I think the hardware specs are not > important. > > > > 1) If we query all data using > > > > * val scanner = * > > * kuduClient.getAsyncScannerBuilder(table)* > > * > .addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema, > ComparisonOp.EQUAL, column1Value)).build()* > > > > We use 3 scanners in parallel (one query for each unique value of column1). > > > > All fields from the returned rows are read and some internal structures > are built. > > > > In this case, it takes *~40 sec* to load all the data. > > > > 2) If we query using “InListPredicate”, then the performance is > super slow. > > > > * val scanner = * > > * kuduClient.getAsyncScannerBuilder(table)* > > * > .addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema, > ComparisonOp.EQUAL, column1Value))* > > * .addPredicate(KuduPredicate.newInListPredicate(Column2Schema, > column2Values.asJava)).build()* > > > > Same as in 1), 3 scanners in parallel, all records are read and some > in-memory structures are built. This time column2 values are split into a > bunch of chunks and we send a request for each unique value of column1 and > each chunk of column2 values. > Are you sorting the values of 'column2' before doing the chunking? Kudu doesn't use indexes for evaluating IN-list predicates except for using the min(in-list-values) and max(in-list-values). So, if you had for example: pre-chunk in-list: 1,2,3,4,5,6 chunk 1: col2 IN (1,6) chunk 2: col2 IN (2,5) chunk 3: col2 IN (3,4) then you will actually scan over the middle portion of that table 3 times. If you sort the in-list before chunking you'll avoid the multiple-scan effect here. -Todd -- Todd Lipcon Software Engineer, Cloudera
