I forgot to mention: When inserting to table B from table A I am setting PARQUET_FILE_SIZE option to 256MB.
On Tue, 17 Mar 2020 at 17:40, Ravi Kanth <ravikanth....@gmail.com> wrote: > Hi Community, > > I have a table(say A) that has 1000 small files of 100 MB each. I want to > create another table(say B) using the same data from A to generate 256 MB > files to match with our HDFS block. > > I am doing *insert into select *. from A *but this generates me 3000 > small files of 30MB each. One of the reasons being the computation is > happening on 20 daemons. > > Upon reading the documentation its suggested *setting num_node=1* to > disable distributed tasks getting submitted to multiple nodes. I have two > more problems here: > 1. This is not optimal as one node is overwhelmed with computations. > Often can run out on scratch space limits at production loads. > 2. This still doesn't give me 256 MB files. > > In the above context, my questions are: > 1. Can impala perform a reduce operation on the data from multiple nodes > to write a single 256MB file? > 2. Is there any other way I can use to generate these files of 256MB each? > > The current workaround is using Hive. It works perfectly but I want to > make use of impala capabilities in doing so. > > Impala Version: 2.12.0 > > Thanks in advance, > Ravi >