Hi Tenny I see in the plan that the kudu tables are missing statistics, so a first step would be to compute statistics a check the result. Also From the plan I see this particular query is only scanning from 3 kudu hosts, vs 24 in parquet. If queries are bound to scan a small range maybe consider adding hash partitioning with coarser grained range partitions.
Best David Sent from my iPhone > On Mar 10, 2017, at 3:05 PM, tenny susanto <[email protected]> wrote: > > Ahh..no, I did not configure any compression on the kudu table. I will go > check the docs and enable compression. > > In the meantime, here's the query profiles against impala parquet table vs my > existing no compression kudu table. > > >> On Fri, Mar 10, 2017 at 11:50 AM, Todd Lipcon <[email protected]> wrote: >> Hi Tenny, >> >> Sorry for the delay on this thread. Just wanted to check in and find out how >> the experiments are going. >> >> Do you have query profiles of the query against Parquet and the same query >> against Kudu? The 15x difference you reported is not expected. >> >> I didn't see it mentioned above in the thread: did you configure any >> encoding or compression on the Kudu tables? This is often a source of major >> performance differences. >> >> -Todd >> >>> On Fri, Feb 24, 2017 at 2:11 PM, tenny susanto <[email protected]> >>> wrote: >>> On my impala parquet table, each day partition is about 500MB - 1GB. >>> >>> >>> So using range partition by day, query time went down to 35 sec from 123 sec >>> >>> >>> Query against the impala table is 2 seconds. >>> >>> >>> >>> >>>> On Fri, Feb 24, 2017 at 1:34 PM, Dan Burkert <[email protected]> wrote: >>>> Hi Tenny, >>>> >>>> 1000 partitions is on the upper end of what I'd recommend - with 3x >>>> replication that's 125 tablet replicas per tablet server (something more >>>> like 20 or 30 would be ideal depending on hardware). How much data does >>>> each day have? I would aim for tablet size on the order of 50GiB, so if >>>> it's not that much per day you could try making week or month wide >>>> partitions. Just bumping the number of partitions and being able to take >>>> advantage of partition pruning should improve the performance tremendously. >>>> >>>> In the next release we're adding support for pushdown IN list predicates, >>>> which could help your query even more if you could put company_id as the >>>> first component of your primary key. That being said, I think improved >>>> range partition will likely give the most dramatic improvements, and >>>> there's no need to wait. >>>> >>>> Week wide range partitions can be specified like: >>>> >>>> PARTITION 20170101 <= VALUES < 20170108, >>>> PARTITION 20170108 <= VALUES < 20170115, >>>> ... >>>> >>>> >>>> - Dan >>>> >>>>> On Fri, Feb 24, 2017 at 1:15 PM, tenny susanto <[email protected]> >>>>> wrote: >>>>> I have 24 tablet servers. >>>>> >>>>> I added an id column because I needed a unique column to be the primary >>>>> key as kudu required primary key to be specified. My original table >>>>> actually has 20 columns with no single primary key column. I concatenated >>>>> 5 of them to build a unique id column which I made it as part of the >>>>> primary key. I have tried specifying 5 columns to be the primary key but >>>>> I noticed the inserts were much slower, so I tried with just 2 columns as >>>>> primary key instead, seems to improve insert speed. >>>>> >>>>> So this is my new schema and will measure query speed with it. If I >>>>> partition by day, is 1000 partitions too many? What is the recommended >>>>> maximum limit in the number of partitions kudu can handle? >>>>> >>>>> CREATE TABLE kudu_fact_table ( >>>>> print_date_id, >>>>> id STRING, >>>>> company_id INT, >>>>> transcount INT) >>>>> PRIMARY KEY(print_date_id,id) >>>>> ) PARTITION BY RANGE (print_date_id) >>>>> ( >>>>> PARTITION VALUE = 20170101, >>>>> PARTITION VALUE = 20170102 ... (1 partition for each day, and I have 3 >>>>> year's worth of data) >>>>> ) >>>>> STORED AS KUDU >>>>> TBLPROPERTIES( >>>>> 'kudu.table_name' = 'kudu_fact_table', >>>>> 'kudu.master_addresses' = 'myserver:7051' >>>>> ); >>>>> >>>>> >>>>> >>>>>> On Thu, Feb 23, 2017 at 6:29 PM, Todd Lipcon <[email protected]> wrote: >>>>>> I'd add that moving the print_date_id to the beginning of the primary >>>>>> key in the Kudu fact table would allow each server to do a range scan >>>>>> instead of a full scan. >>>>>> >>>>>> -Todd >>>>>> >>>>>>> On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <[email protected]> wrote: >>>>>>> Hi Tenny, >>>>>>> >>>>>>> First off, how many tablet servers are in your cluster? 16 partitions >>>>>>> is appropriate for one or maybe two tablet servers, so if your cluster >>>>>>> is bigger you could try bumping the number of partitions. >>>>>>> >>>>>>> Second, the schemas don't look identical, you have an additional 'id' >>>>>>> column in the Kudu table, and crucially, it doesn't have any >>>>>>> predicates, so this query is doing a full table scan. >>>>>>> >>>>>>> Finally, the Parquet table is likely able to take advantage of >>>>>>> significant partition pruning due to the between clause. An equivalent >>>>>>> in Kudu would be range partitioning on the print_date_id. You might >>>>>>> try doing the same for Kudu. >>>>>>> >>>>>>> - Dan >>>>>>> >>>>>>>> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto >>>>>>>> <[email protected]> wrote: >>>>>>>> I have a table (call this fact_table) that I want to create in kudu. >>>>>>>> >>>>>>>> I have an equivalent table in impala/parquet that is partitioned by >>>>>>>> day_id. >>>>>>>> >>>>>>>> create table impala_fact_table ( >>>>>>>> company_id INT, >>>>>>>> transcount INT) >>>>>>>> partitioned by >>>>>>>> (print_date_id INT) >>>>>>>> STORED AS PARQUET; >>>>>>>> >>>>>>>> so a common query would be: >>>>>>>> >>>>>>>> select sum(transcount) >>>>>>>> from impala_fact_table f >>>>>>>> join with company_dim c on f.company_id = c.company_id >>>>>>>> where c.company_id in (123,456) >>>>>>>> and f.print_date_id between 20170101 and 20170202 >>>>>>>> >>>>>>>> I created an equivalent of the fact table in kudu: >>>>>>>> >>>>>>>> CREATE TABLE kudu_fact_table ( >>>>>>>> id STRING, >>>>>>>> print_date_id, >>>>>>>> company_id INT, >>>>>>>> transcount INT) >>>>>>>> PRIMARY KEY(id,print_date_id) >>>>>>>> ) PARTITION BY HASH PARTITIONS 16 >>>>>>>> ) >>>>>>>> STORED AS KUDU >>>>>>>> TBLPROPERTIES( >>>>>>>> 'kudu.table_name' = 'kudu_fact_table', >>>>>>>> 'kudu.master_addresses' = 'myserver:7051' >>>>>>>> ); >>>>>>>> >>>>>>>> But the performance of the join with this kudu table is terrible, 2 >>>>>>>> secs with impala table vs 126 secs with kudu table. >>>>>>>> >>>>>>>> select sum(transcount) >>>>>>>> from kudu_fact_table f >>>>>>>> join with company_dim c on f.company_id = c.company_id >>>>>>>> where c.company_id in (123,456) >>>>>>>> and f.print_date_id between 20170101 and 20170202 >>>>>>>> >>>>>>>> How should I design my kudu table so performance is somewhat >>>>>>>> comparable? >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Todd Lipcon >>>>>> Software Engineer, Cloudera >>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> >>>>> Tenny Susanto >>> >>> >>> >>> -- >>> Regards, >>> >>> Tenny Susanto >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera > > > > -- > Regards, > > Tenny Susanto > > <benchmark_impala_parquet.txt> > <benchmark_kudu_range_partition.txt>
