I forgot to mention:

When inserting to table B from table A I am setting PARQUET_FILE_SIZE
option to 256MB.



On Tue, 17 Mar 2020 at 17:40, Ravi Kanth <ravikanth....@gmail.com> wrote:

> Hi Community,
>
> I have a table(say A) that has 1000 small files of 100 MB each. I want to
> create another table(say B) using the same data from A to generate 256 MB
> files to match with our HDFS block.
>
> I am doing *insert into select *. from A *but this generates me 3000
> small files of 30MB each. One of the reasons being the computation is
> happening on 20 daemons.
>
> Upon reading the documentation its suggested *setting num_node=1* to
> disable distributed tasks getting submitted to multiple nodes. I have two
> more problems here:
> 1.  This is not optimal as one node is overwhelmed with computations.
> Often can run out on scratch space limits at production loads.
> 2. This still doesn't give me 256 MB files.
>
> In the above context, my questions are:
> 1. Can impala perform a reduce operation on the data from multiple nodes
> to write a single 256MB file?
> 2. Is there any other way I can use to generate these files of 256MB each?
>
> The current workaround is using Hive. It works perfectly but I want to
> make use of impala capabilities in doing so.
>
> Impala Version: 2.12.0
>
> Thanks in advance,
> Ravi
>

Reply via email to