Hi Todd The schema I've posted was generate by 'CREATE kudu_table AS SELECT * FROM parquet_table`.
Last weekend, I've tested on difference combinations of encodings and compressions. Currenlty, size is fallen by 70% but it is still bigger than parquet 200%. I'm still testing which encoding is the best for specific columns. I hope it gets closer to Parquet ;) Thanks Jason 2017-03-13 15:30 GMT+09:00 Todd Lipcon <[email protected]>: > Hi Jason, > > The first thing that jumps out to me is that you aren't using dictionary > encoding on your string columns. I would recommend using DICT_ENCODING for > all string fields and BIT_SHUFFLE for all int/double/float fields. If you > have any string fields which are not repetitive (low cardinality) then I > would also recommend enabling LZ4 compression on them (Parquet uses lz4 by > default on all strings). > > That should get you close to Parquet sizes (and those are the new defaults > in the upcoming 1.3 release). If you still see a 6x blowup after making > these changes please report back. > > -Todd > > On Fri, Mar 10, 2017 at 7:16 PM, Jason Heo <[email protected]> > wrote: > >> Hello, I'm new to Apache Kudu. I was really impressed by the concept of >> Kudu and benchmark results. I'm considering using (Impala + Kudu) on my >> team project. >> >> One of the issues I have is that Kudu Table is too big compared to >> Parquet File >> >> - Parquet File: 1.3TB >> - Kudu Table: 8.6TB >> >> (both tables configured 3 replica factor) >> >> I'm using Kudu with CDH 5.10 and most of the configurations is not >> changed (I've only changed `memory_limit_hard_bytes` and >> `block_cache_capacity_mb` to prevent bulk load error) >> >> When I changed `ENCODING` for some fields, only decreased by 5%. I'm >> thinking there are some optimization techniques to reduce Kudu table size. >> >> I would really appreciate it if someone gives advice to me. >> >> Thanks for advance answer. >> >> `parquet_table` has 38 STRING fields and 6B rows. >> >> The schema of `parquet_table` looks like belows >> >> ``` >> > SHOW CREATE TABLE parquet_table; >> +----------------------------------------------------------- >> ----------------------+ >> | result >> | >> +----------------------------------------------------------- >> ----------------------+ >> | CREATE EXTERNAL TABLE default.parquet_table ( >> | >> | a STRING, >> | >> | b STRING, >> | >> | c STRING, >> | >> | d STRING, >> | >> ... >> ... >> | ) >> | >> | PARTITIONED BY ( >> | >> | ymd STRING >> | >> | ) >> | >> | WITH SERDEPROPERTIES ('serialization.format'='1') >> | >> | STORED AS PARQUET >> | >> | LOCATION 'hdfs://hostname/path/to/parquet' | >> | >> | >> +----------------------------------------------------------- >> ----------------------+ >> ``` >> >> I've created `kudu_table` and bulk loaded using `INSERT INTO kudu SELECT >> * FROM parquet_table` >> >> ``` >> > SHOW CREATE TABLE kudu_table; >> +----------------------------------------------------------- >> -----------------------+ >> | result >> | >> +----------------------------------------------------------- >> -----------------------+ >> | CREATE TABLE default.kudu_table ( >> | >> | a STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION >> DEFAULT_COMPRESSION, | >> | b STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION >> DEFAULT_COMPRESSION, | >> | c STRING NULL ENCODING AUTO_ENCODING COMPRESSION >> DEFAULT_COMPRESSION, | >> | d STRING NULL ENCODING AUTO_ENCODING COMPRESSION >> DEFAULT_COMPRESSION, | >> ... >> | PRIMARY KEY (a, b) >> | >> | ) >> | >> | PARTITION BY HASH (a) PARTITIONS 40 >> | >> | STORED AS KUDU >> | >> | TBLPROPERTIES ('kudu.master_addresses'='host1,host2', >> 'kudu.table_name'='impala::kudu_table') | >> +----------------------------------------------------------- >> -----------------------+ >> ``` >> >> > > > -- > Todd Lipcon > Software Engineer, Cloudera >
