I'd add that moving the print_date_id to the beginning of the primary key in the Kudu fact table would allow each server to do a range scan instead of a full scan.
-Todd On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <[email protected]> wrote: > Hi Tenny, > > First off, how many tablet servers are in your cluster? 16 partitions is > appropriate for one or maybe two tablet servers, so if your cluster is > bigger you could try bumping the number of partitions. > > Second, the schemas don't look identical, you have an additional 'id' > column in the Kudu table, and crucially, it doesn't have any predicates, so > this query is doing a full table scan. > > Finally, the Parquet table is likely able to take advantage of significant > partition pruning due to the between clause. An equivalent in Kudu would > be range partitioning on the print_date_id. You might try doing the same > for Kudu. > > - Dan > > On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <[email protected]> > wrote: > >> I have a table (call this fact_table) that I want to create in kudu. >> >> I have an equivalent table in impala/parquet that is partitioned by >> day_id. >> >> create table impala_fact_table ( >> company_id INT, >> transcount INT) >> partitioned by >> (print_date_id INT) >> STORED AS PARQUET; >> >> so a common query would be: >> >> select sum(transcount) >> from impala_fact_table f >> join with company_dim c on f.company_id = c.company_id >> where c.company_id in (123,456) >> and f.print_date_id between 20170101 and 20170202 >> >> I created an equivalent of the fact table in kudu: >> >> CREATE TABLE kudu_fact_table ( >> id STRING, >> print_date_id, >> company_id INT, >> transcount INT) >> PRIMARY KEY(id,print_date_id) >> ) PARTITION BY HASH PARTITIONS 16 >> ) >> STORED AS KUDU >> TBLPROPERTIES( >> 'kudu.table_name' = 'kudu_fact_table', >> 'kudu.master_addresses' = 'myserver:7051' >> ); >> >> But the performance of the join with this kudu table is terrible, 2 secs >> with impala table vs 126 secs with kudu table. >> >> select sum(transcount) >> from kudu_fact_table f >> join with company_dim c on f.company_id = c.company_id >> where c.company_id in (123,456) >> and f.print_date_id between 20170101 and 20170202 >> >> How should I design my kudu table so performance is somewhat comparable? >> >> > -- Todd Lipcon Software Engineer, Cloudera
