I have a table (call this fact_table) that I want to create in kudu. I have an equivalent table in impala/parquet that is partitioned by day_id.
create table impala_fact_table ( company_id INT, transcount INT) partitioned by (print_date_id INT) STORED AS PARQUET; so a common query would be: select sum(transcount) from impala_fact_table f join with company_dim c on f.company_id = c.company_id where c.company_id in (123,456) and f.print_date_id between 20170101 and 20170202 I created an equivalent of the fact table in kudu: CREATE TABLE kudu_fact_table ( id STRING, print_date_id, company_id INT, transcount INT) PRIMARY KEY(id,print_date_id) ) PARTITION BY HASH PARTITIONS 16 ) STORED AS KUDU TBLPROPERTIES( 'kudu.table_name' = 'kudu_fact_table', 'kudu.master_addresses' = 'myserver:7051' ); But the performance of the join with this kudu table is terrible, 2 secs with impala table vs 126 secs with kudu table. select sum(transcount) from kudu_fact_table f join with company_dim c on f.company_id = c.company_id where c.company_id in (123,456) and f.print_date_id between 20170101 and 20170202 How should I design my kudu table so performance is somewhat comparable?
