Re: Apache Kudu Table is 6.6 times larger than Parquet File.

Jason Heo Mon, 13 Mar 2017 00:20:26 -0700

Hi Todd

The schema I've posted was generate by 'CREATE kudu_table AS SELECT * FROM
parquet_table`.


Last weekend, I've tested on difference combinations of encodings and
compressions. Currenlty, size is fallen by 70% but it is still bigger than
parquet 200%. I'm still testing which encoding is the best for specific
columns. I hope it gets closer to Parquet ;)

Thanks

Jason

2017-03-13 15:30 GMT+09:00 Todd Lipcon <[email protected]>:

> Hi Jason,
>
> The first thing that jumps out to me is that you aren't using dictionary
> encoding on your string columns. I would recommend using DICT_ENCODING for
> all string fields and BIT_SHUFFLE for all int/double/float fields. If you
> have any string fields which are not repetitive (low cardinality) then I
> would also recommend enabling LZ4 compression on them (Parquet uses lz4 by
> default on all strings).
>
> That should get you close to Parquet sizes (and those are the new defaults
> in the upcoming 1.3 release). If you still see a 6x blowup after making
> these changes please report back.
>
> -Todd
>
> On Fri, Mar 10, 2017 at 7:16 PM, Jason Heo <[email protected]>
> wrote:
>
>> Hello, I'm new to Apache Kudu. I was really impressed by the concept of
>> Kudu and benchmark results. I'm considering using (Impala + Kudu) on my
>> team project.
>>
>> One of the issues I have is that Kudu Table is too big compared to
>> Parquet File
>>
>> - Parquet File: 1.3TB
>> - Kudu Table: 8.6TB
>>
>> (both tables configured 3 replica factor)
>>
>> I'm using Kudu with CDH 5.10 and most of the configurations is not
>> changed (I've only changed `memory_limit_hard_bytes` and
>> `block_cache_capacity_mb` to prevent bulk load error)
>>
>> When I changed `ENCODING` for some fields, only decreased by 5%. I'm
>> thinking there are some optimization techniques to reduce Kudu table size.
>>
>> I would really appreciate it if someone gives advice to me.
>>
>> Thanks for advance answer.
>>
>> `parquet_table` has 38 STRING fields and 6B rows.
>>
>> The schema of `parquet_table` looks like belows
>>
>>     ```
>>     > SHOW CREATE TABLE parquet_table;
>>     +-----------------------------------------------------------
>> ----------------------+
>>     | result
>>              |
>>     +-----------------------------------------------------------
>> ----------------------+
>>     | CREATE EXTERNAL TABLE default.parquet_table (
>>             |
>>     |   a STRING,
>>             |
>>     |   b STRING,
>>             |
>>     |   c STRING,
>>             |
>>     |   d STRING,
>>             |
>>         ...
>>         ...
>>     | )
>>             |
>>     | PARTITIONED BY (
>>              |
>>     |   ymd STRING
>>              |
>>     | )
>>             |
>>     | WITH SERDEPROPERTIES ('serialization.format'='1')
>>             |
>>     | STORED AS PARQUET
>>             |
>>     | LOCATION 'hdfs://hostname/path/to/parquet' |
>>     |
>>             |
>>     +-----------------------------------------------------------
>> ----------------------+
>>     ```
>>
>> I've created `kudu_table` and bulk loaded using `INSERT INTO kudu SELECT
>> * FROM parquet_table`
>>
>>     ```
>>     > SHOW CREATE TABLE kudu_table;
>>     +-----------------------------------------------------------
>> -----------------------+
>>     | result
>>               |
>>     +-----------------------------------------------------------
>> -----------------------+
>>     | CREATE TABLE default.kudu_table (
>>              |
>>     |   a STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION
>> DEFAULT_COMPRESSION,      |
>>     |   b STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION
>> DEFAULT_COMPRESSION,      |
>>     |   c STRING NULL ENCODING AUTO_ENCODING COMPRESSION
>> DEFAULT_COMPRESSION,          |
>>     |   d STRING NULL ENCODING AUTO_ENCODING COMPRESSION
>> DEFAULT_COMPRESSION,          |
>>         ...
>>     |   PRIMARY KEY (a, b)
>>               |
>>     | )
>>              |
>>     | PARTITION BY HASH (a) PARTITIONS 40
>>              |
>>     | STORED AS KUDU
>>               |
>>     | TBLPROPERTIES ('kudu.master_addresses'='host1,host2',
>>                     'kudu.table_name'='impala::kudu_table') |
>>     +-----------------------------------------------------------
>> -----------------------+
>>     ```
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Apache Kudu Table is 6.6 times larger than Parquet File.

Reply via email to