On my impala parquet table, each day partition is about 500MB - 1GB.
So using range partition by day, query time went down to 35 sec from 123 sec Query against the impala table is 2 seconds. On Fri, Feb 24, 2017 at 1:34 PM, Dan Burkert <[email protected]> wrote: > Hi Tenny, > > 1000 partitions is on the upper end of what I'd recommend - with 3x > replication that's 125 tablet replicas per tablet server (something more > like 20 or 30 would be ideal depending on hardware). How much data does > each day have? I would aim for tablet size on the order of 50GiB, so if > it's not that much per day you could try making week or month wide > partitions. Just bumping the number of partitions and being able to take > advantage of partition pruning should improve the performance tremendously. > > In the next release we're adding support for pushdown IN list predicates, > which could help your query even more if you could put company_id as the > first component of your primary key. That being said, I think improved > range partition will likely give the most dramatic improvements, and > there's no need to wait. > > Week wide range partitions can be specified like: > > PARTITION 20170101 <= VALUES < 20170108, > PARTITION 20170108 <= VALUES < 20170115, > ... > > > - Dan > > On Fri, Feb 24, 2017 at 1:15 PM, tenny susanto <[email protected]> > wrote: > >> I have 24 tablet servers. >> >> I added an id column because I needed a unique column to be the primary >> key as kudu required primary key to be specified. My original table >> actually has 20 columns with no single primary key column. I concatenated 5 >> of them to build a unique id column which I made it as part of the primary >> key. I have tried specifying 5 columns to be the primary key but I noticed >> the inserts were much slower, so I tried with just 2 columns as primary key >> instead, seems to improve insert speed. >> >> So this is my new schema and will measure query speed with it. If I >> partition by day, is 1000 partitions too many? What is the recommended >> maximum limit in the number of partitions kudu can handle? >> >> CREATE TABLE kudu_fact_table ( >> print_date_id, >> id STRING, >> company_id INT, >> transcount INT) >> PRIMARY KEY(print_date_id,id) >> ) PARTITION BY RANGE (print_date_id) >> ( >> PARTITION VALUE = 20170101, >> PARTITION VALUE = 20170102 ... (1 partition for each day, and I have 3 >> year's worth of data) >> ) >> STORED AS KUDU >> TBLPROPERTIES( >> 'kudu.table_name' = 'kudu_fact_table', >> 'kudu.master_addresses' = 'myserver:7051' >> ); >> >> >> >> On Thu, Feb 23, 2017 at 6:29 PM, Todd Lipcon <[email protected]> wrote: >> >>> I'd add that moving the print_date_id to the beginning of the primary >>> key in the Kudu fact table would allow each server to do a range scan >>> instead of a full scan. >>> >>> -Todd >>> >>> On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <[email protected]> wrote: >>> >>>> Hi Tenny, >>>> >>>> First off, how many tablet servers are in your cluster? 16 partitions >>>> is appropriate for one or maybe two tablet servers, so if your cluster is >>>> bigger you could try bumping the number of partitions. >>>> >>>> Second, the schemas don't look identical, you have an additional 'id' >>>> column in the Kudu table, and crucially, it doesn't have any predicates, so >>>> this query is doing a full table scan. >>>> >>>> Finally, the Parquet table is likely able to take advantage of >>>> significant partition pruning due to the between clause. An equivalent in >>>> Kudu would be range partitioning on the print_date_id. You might try doing >>>> the same for Kudu. >>>> >>>> - Dan >>>> >>>> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <[email protected]> >>>> wrote: >>>> >>>>> I have a table (call this fact_table) that I want to create in kudu. >>>>> >>>>> I have an equivalent table in impala/parquet that is partitioned by >>>>> day_id. >>>>> >>>>> create table impala_fact_table ( >>>>> company_id INT, >>>>> transcount INT) >>>>> partitioned by >>>>> (print_date_id INT) >>>>> STORED AS PARQUET; >>>>> >>>>> so a common query would be: >>>>> >>>>> select sum(transcount) >>>>> from impala_fact_table f >>>>> join with company_dim c on f.company_id = c.company_id >>>>> where c.company_id in (123,456) >>>>> and f.print_date_id between 20170101 and 20170202 >>>>> >>>>> I created an equivalent of the fact table in kudu: >>>>> >>>>> CREATE TABLE kudu_fact_table ( >>>>> id STRING, >>>>> print_date_id, >>>>> company_id INT, >>>>> transcount INT) >>>>> PRIMARY KEY(id,print_date_id) >>>>> ) PARTITION BY HASH PARTITIONS 16 >>>>> ) >>>>> STORED AS KUDU >>>>> TBLPROPERTIES( >>>>> 'kudu.table_name' = 'kudu_fact_table', >>>>> 'kudu.master_addresses' = 'myserver:7051' >>>>> ); >>>>> >>>>> But the performance of the join with this kudu table is terrible, 2 >>>>> secs with impala table vs 126 secs with kudu table. >>>>> >>>>> select sum(transcount) >>>>> from kudu_fact_table f >>>>> join with company_dim c on f.company_id = c.company_id >>>>> where c.company_id in (123,456) >>>>> and f.print_date_id between 20170101 and 20170202 >>>>> >>>>> How should I design my kudu table so performance is somewhat >>>>> comparable? >>>>> >>>>> >>>> >>> >>> >>> -- >>> Todd Lipcon >>> Software Engineer, Cloudera >>> >> >> >> >> -- >> Regards, >> >> Tenny Susanto >> >> > -- Regards, Tenny Susanto
