Hello, I'm new to Apache Kudu. I was really impressed by the concept of
Kudu and benchmark results. I'm considering using (Impala + Kudu) on my
team project.
One of the issues I have is that Kudu Table is too big compared to Parquet
File
- Parquet File: 1.3TB
- Kudu Table: 8.6TB
(both tables configured 3 replica factor)
I'm using Kudu with CDH 5.10 and most of the configurations is not changed
(I've only changed `memory_limit_hard_bytes` and `block_cache_capacity_mb`
to prevent bulk load error)
When I changed `ENCODING` for some fields, only decreased by 5%. I'm
thinking there are some optimization techniques to reduce Kudu table size.
I would really appreciate it if someone gives advice to me.
Thanks for advance answer.
`parquet_table` has 38 STRING fields and 6B rows.
The schema of `parquet_table` looks like belows
```
> SHOW CREATE TABLE parquet_table;
+---------------------------------------------------------------------------------+
| result
|
+---------------------------------------------------------------------------------+
| CREATE EXTERNAL TABLE default.parquet_table (
|
| a STRING,
|
| b STRING,
|
| c STRING,
|
| d STRING,
|
...
...
| )
|
| PARTITIONED BY (
|
| ymd STRING
|
| )
|
| WITH SERDEPROPERTIES ('serialization.format'='1')
|
| STORED AS PARQUET
|
| LOCATION 'hdfs://hostname/path/to/parquet' |
|
|
+---------------------------------------------------------------------------------+
```
I've created `kudu_table` and bulk loaded using `INSERT INTO kudu SELECT *
FROM parquet_table`
```
> SHOW CREATE TABLE kudu_table;
+----------------------------------------------------------------------------------+
| result
|
+----------------------------------------------------------------------------------+
| CREATE TABLE default.kudu_table (
|
| a STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION
DEFAULT_COMPRESSION, |
| b STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION
DEFAULT_COMPRESSION, |
| c STRING NULL ENCODING AUTO_ENCODING COMPRESSION
DEFAULT_COMPRESSION, |
| d STRING NULL ENCODING AUTO_ENCODING COMPRESSION
DEFAULT_COMPRESSION, |
...
| PRIMARY KEY (a, b)
|
| )
|
| PARTITION BY HASH (a) PARTITIONS 40
|
| STORED AS KUDU
|
| TBLPROPERTIES ('kudu.master_addresses'='host1,host2',
'kudu.table_name'='impala::kudu_table') |
+----------------------------------------------------------------------------------+
```