Hi Community, I have a table(say A) that has 1000 small files of 100 MB each. I want to create another table(say B) using the same data from A to generate 256 MB files to match with our HDFS block.
I am doing *insert into select *. from A *but this generates me 3000 small files of 30MB each. One of the reasons being the computation is happening on 20 daemons. Upon reading the documentation its suggested *setting num_node=1* to disable distributed tasks getting submitted to multiple nodes. I have two more problems here: 1. This is not optimal as one node is overwhelmed with computations. Often can run out on scratch space limits at production loads. 2. This still doesn't give me 256 MB files. In the above context, my questions are: 1. Can impala perform a reduce operation on the data from multiple nodes to write a single 256MB file? 2. Is there any other way I can use to generate these files of 256MB each? The current workaround is using Hive. It works perfectly but I want to make use of impala capabilities in doing so. Impala Version: 2.12.0 Thanks in advance, Ravi