I wanted to find the optimized parquet file size. It looks like no matther
how much I put on set block size, hive always gave the same result on
parquet file sizes.
I was copying everything from a table to another same dummy table for the
experiment. There are a lot small files. Here are the
I created a hive table, use insert select to load existing impala data to
hive table. I noticed 2 things.
1. The data size is more than twice the size of old data. Old data used
impala to do the compression.
2. No matter how large I set parquet block size, hive always generate
parquet files with