Re: kudu table design question

tenny susanto Fri, 24 Feb 2017 13:16:38 -0800

I have 24 tablet servers.

I added an id column because I needed a unique column to be the primary key
as kudu required primary key to be specified.  My original table actually
has 20 columns with no single primary key column. I concatenated 5 of them
to build a unique id column which I made it as part of the primary key. I
have tried specifying 5 columns to be the primary key but I noticed the
inserts were much slower, so I tried with just 2 columns as primary key
instead, seems to improve insert speed.


So this is my new schema and will measure query speed with it. If I
partition by day, is 1000 partitions too many? What is the recommended
maximum limit in the number of partitions kudu can handle?

CREATE TABLE kudu_fact_table  (
print_date_id,
id STRING,
company_id INT,
transcount INT)
PRIMARY KEY(print_date_id,id)
) PARTITION BY RANGE (print_date_id)
(
  PARTITION VALUE = 20170101,
  PARTITION VALUE = 20170102 ... (1 partition for each day, and I have 3
year's worth of data)
 )
STORED AS KUDU
TBLPROPERTIES(
  'kudu.table_name' = 'kudu_fact_table',
  'kudu.master_addresses' = 'myserver:7051'
);



On Thu, Feb 23, 2017 at 6:29 PM, Todd Lipcon <[email protected]> wrote:

> I'd add that moving the print_date_id to the beginning of the primary key
> in the Kudu fact table would allow each server to do a range scan instead
> of a full scan.
>
> -Todd
>
> On Thu, Feb 23, 2017 at 5:40 PM, Dan Burkert <[email protected]> wrote:
>
>> Hi Tenny,
>>
>> First off, how many tablet servers are in your cluster?  16 partitions is
>> appropriate for one or maybe two tablet servers, so if your cluster is
>> bigger you could try bumping the number of partitions.
>>
>> Second, the schemas don't look identical, you have an additional 'id'
>> column in the Kudu table, and crucially, it doesn't have any predicates, so
>> this query is doing a full table scan.
>>
>> Finally, the Parquet table is likely able to take advantage of
>> significant partition pruning due to the between clause.  An equivalent in
>> Kudu would be range partitioning on the print_date_id.  You might try doing
>> the same for Kudu.
>>
>> - Dan
>>
>> On Thu, Feb 23, 2017 at 5:08 PM, tenny susanto <[email protected]>
>> wrote:
>>
>>> I have a table (call this fact_table)  that I want to create in kudu.
>>>
>>> I have an equivalent table in impala/parquet that is partitioned by
>>> day_id.
>>>
>>> create table impala_fact_table (
>>> company_id INT,
>>> transcount INT)
>>> partitioned by
>>> (print_date_id INT)
>>> STORED AS PARQUET;
>>>
>>> so a common query would be:
>>>
>>> select  sum(transcount)
>>> from impala_fact_table f
>>> join with company_dim c on f.company_id = c.company_id
>>> where c.company_id in (123,456)
>>> and f.print_date_id between 20170101 and 20170202
>>>
>>> I created an equivalent of the fact table in kudu:
>>>
>>> CREATE TABLE kudu_fact_table  (
>>> id STRING,
>>> print_date_id,
>>> company_id INT,
>>> transcount INT)
>>> PRIMARY KEY(id,print_date_id)
>>> ) PARTITION BY HASH PARTITIONS 16
>>> )
>>> STORED AS KUDU
>>> TBLPROPERTIES(
>>>   'kudu.table_name' = 'kudu_fact_table',
>>>   'kudu.master_addresses' = 'myserver:7051'
>>> );
>>>
>>> But the performance of the join with this kudu table is terrible, 2 secs
>>> with impala table vs 126 secs with kudu table.
>>>
>>> select  sum(transcount)
>>> from kudu_fact_table f
>>> join with company_dim c on f.company_id = c.company_id
>>> where c.company_id in (123,456)
>>> and f.print_date_id between 20170101 and 20170202
>>>
>>> How should I design my kudu table so performance is somewhat comparable?
>>>
>>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Regards,

Tenny Susanto

Re: kudu table design question

Reply via email to