Hi Tenny, First off, how many tablet servers are in your cluster? 16 partitions is appropriate for one or maybe two tablet servers, so if your cluster is bigger you could try bumping the number of partitions.
Second, the schemas don't look identical, you have an additional 'id' column in the Kudu table, and crucially, it doesn't have any predicates, so this query is doing a full table scan. Finally, the Parquet table is likely able to take advantage of significant partition pruning due to the between clause. An equivalent in Kudu would be range partitioning on the print_date_id. You might try doing the same for Kudu. - Dan On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <[email protected]> wrote: > I have a table (call this fact_table) that I want to create in kudu. > > I have an equivalent table in impala/parquet that is partitioned by > day_id. > > create table impala_fact_table ( > company_id INT, > transcount INT) > partitioned by > (print_date_id INT) > STORED AS PARQUET; > > so a common query would be: > > select sum(transcount) > from impala_fact_table f > join with company_dim c on f.company_id = c.company_id > where c.company_id in (123,456) > and f.print_date_id between 20170101 and 20170202 > > I created an equivalent of the fact table in kudu: > > CREATE TABLE kudu_fact_table ( > id STRING, > print_date_id, > company_id INT, > transcount INT) > PRIMARY KEY(id,print_date_id) > ) PARTITION BY HASH PARTITIONS 16 > ) > STORED AS KUDU > TBLPROPERTIES( > 'kudu.table_name' = 'kudu_fact_table', > 'kudu.master_addresses' = 'myserver:7051' > ); > > But the performance of the join with this kudu table is terrible, 2 secs > with impala table vs 126 secs with kudu table. > > select sum(transcount) > from kudu_fact_table f > join with company_dim c on f.company_id = c.company_id > where c.company_id in (123,456) > and f.print_date_id between 20170101 and 20170202 > > How should I design my kudu table so performance is somewhat comparable? > >
