Generating a fixed size parquet file when doing Insert select *

Ravi Kanth Tue, 17 Mar 2020 17:41:50 -0700

Hi Community,

I have a table(say A) that has 1000 small files of 100 MB each. I want to
create another table(say B) using the same data from A to generate 256 MB
files to match with our HDFS block.


I am doing *insert into select *. from A *but this generates me 3000 small
files of 30MB each. One of the reasons being the computation is happening
on 20 daemons.

Upon reading the documentation its suggested *setting num_node=1* to
disable distributed tasks getting submitted to multiple nodes. I have two
more problems here:
1.  This is not optimal as one node is overwhelmed with computations. Often
can run out on scratch space limits at production loads.
2. This still doesn't give me 256 MB files.

In the above context, my questions are:
1. Can impala perform a reduce operation on the data from multiple nodes to
write a single 256MB file?
2. Is there any other way I can use to generate these files of 256MB each?

The current workaround is using Hive. It works perfectly but I want to make
use of impala capabilities in doing so.

Impala Version: 2.12.0

Thanks in advance,
Ravi

Generating a fixed size parquet file when doing Insert select *

Reply via email to