I have 24 tablet servers. I added an id column because I needed a unique column to be the primary key as kudu required primary key to be specified. My original table actually has 20 columns with no single primary key column. I concatenated 5 of them to build a unique id column which I made it as part of the primary key. I have tried specifying 5 columns to be the primary key but I noticed the inserts were much slower, so I tried with just 2 columns as primary key instead, seems to improve insert speed.
So this is my new schema and will measure query speed with it. If I partition by day, is 1000 partitions too many? What is the recommended maximum limit in the number of partitions kudu can handle? CREATE TABLE kudu_fact_table ( print_date_id, id STRING, company_id INT, transcount INT) PRIMARY KEY(print_date_id,id) ) PARTITION BY RANGE (print_date_id) ( PARTITION VALUE = 20170101, PARTITION VALUE = 20170102 ... (1 partition for each day, and I have 3 year's worth of data) ) STORED AS KUDU TBLPROPERTIES( 'kudu.table_name' = 'kudu_fact_table', 'kudu.master_addresses' = 'myserver:7051' ); On Thu, Feb 23, 2017 at 6:29 PM, Todd Lipcon <[email protected]> wrote: > I'd add that moving the print_date_id to the beginning of the primary key > in the Kudu fact table would allow each server to do a range scan instead > of a full scan. > > -Todd > > On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <[email protected]> wrote: > >> Hi Tenny, >> >> First off, how many tablet servers are in your cluster? 16 partitions is >> appropriate for one or maybe two tablet servers, so if your cluster is >> bigger you could try bumping the number of partitions. >> >> Second, the schemas don't look identical, you have an additional 'id' >> column in the Kudu table, and crucially, it doesn't have any predicates, so >> this query is doing a full table scan. >> >> Finally, the Parquet table is likely able to take advantage of >> significant partition pruning due to the between clause. An equivalent in >> Kudu would be range partitioning on the print_date_id. You might try doing >> the same for Kudu. >> >> - Dan >> >> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <[email protected]> >> wrote: >> >>> I have a table (call this fact_table) that I want to create in kudu. >>> >>> I have an equivalent table in impala/parquet that is partitioned by >>> day_id. >>> >>> create table impala_fact_table ( >>> company_id INT, >>> transcount INT) >>> partitioned by >>> (print_date_id INT) >>> STORED AS PARQUET; >>> >>> so a common query would be: >>> >>> select sum(transcount) >>> from impala_fact_table f >>> join with company_dim c on f.company_id = c.company_id >>> where c.company_id in (123,456) >>> and f.print_date_id between 20170101 and 20170202 >>> >>> I created an equivalent of the fact table in kudu: >>> >>> CREATE TABLE kudu_fact_table ( >>> id STRING, >>> print_date_id, >>> company_id INT, >>> transcount INT) >>> PRIMARY KEY(id,print_date_id) >>> ) PARTITION BY HASH PARTITIONS 16 >>> ) >>> STORED AS KUDU >>> TBLPROPERTIES( >>> 'kudu.table_name' = 'kudu_fact_table', >>> 'kudu.master_addresses' = 'myserver:7051' >>> ); >>> >>> But the performance of the join with this kudu table is terrible, 2 secs >>> with impala table vs 126 secs with kudu table. >>> >>> select sum(transcount) >>> from kudu_fact_table f >>> join with company_dim c on f.company_id = c.company_id >>> where c.company_id in (123,456) >>> and f.print_date_id between 20170101 and 20170202 >>> >>> How should I design my kudu table so performance is somewhat comparable? >>> >>> >> > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Regards, Tenny Susanto
