Re: Generating a fixed size parquet file when doing Insert select *

Tim Armstrong Tue, 17 Mar 2020 23:35:40 -0700

Hi Ravi,
  There's a few details that could help understand the problem better. Is
the destination table partitioned? How many partitions does it have? If you
could share some query profiles or at least the explain plans from the
different queries you're running that would be helpful too.


This is a guess, but the /*+shuffle*/ hint for the insert might solve your
problem - it forces redistribution of the data based on the partition key,
so all of the data for each partition will land on the same node. Impala's
planner tries to be intelligent about whether to redistribute data when
inserting into partitioned tables, but sometimes the decision won't be
right for your needs. Here are example plans of a query that skips the
shuffle without the hint.

[localhost:21000] functional> explain insert into table alltypesinsert
partition (year, month)
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year,
month
from alltypessmall;
Query: explain insert into table alltypesinsert
partition (year, month)
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year,
month
from alltypessmall
+-----------------------------------------------------------------------------------------+
| Explain String
               |
+-----------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=12.01MB Threads=2
              |
| Per-Host Resource Estimates: Memory=92MB
               |
| Codegen disabled by planner
              |
|
              |
| WRITE TO HDFS [functional.alltypesinsert, OVERWRITE=false,
PARTITION-KEYS=(year,month)] |
| |  partitions=4
              |
| |
              |
| 01:SORT
              |
| |  order by: year ASC NULLS LAST, month ASC NULLS LAST
               |
| |  row-size=89B cardinality=100
              |
| |
              |
| 00:SCAN HDFS [functional.alltypessmall]
              |
|    HDFS partitions=4/4 files=4 size=6.32KB
               |
|    row-size=89B cardinality=100
              |
+-----------------------------------------------------------------------------------------+

[localhost:21000] functional> explain insert into table alltypesinsert
partition (year, month) /*+shuffle*/
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year,
month
from alltypessmall;
Query: explain insert into table alltypesinsert
partition (year, month) /*+shuffle*/
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year,
month
from alltypessmall
+-----------------------------------------------------------------------------------------+
| Explain String
               |
+-----------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=12.01MB Threads=3
              |
| Per-Host Resource Estimates: Memory=92MB
               |
| Codegen disabled by planner
              |
|
              |
| WRITE TO HDFS [functional.alltypesinsert, OVERWRITE=false,
PARTITION-KEYS=(year,month)] |
| |  partitions=4
              |
| |
              |
| 02:SORT
              |
| |  order by: year ASC NULLS LAST, month ASC NULLS LAST
               |
| |  row-size=89B cardinality=100
              |
| |
              |
| 01:EXCHANGE [HASH(`year`,`month`)]
               |
| |
              |
| 00:SCAN HDFS [functional.alltypessmall]
              |
|    HDFS partitions=4/4 files=4 size=6.32KB
               |
|    row-size=89B cardinality=100
              |
+-----------------------------------------------------------------------------------------+

On Tue, Mar 17, 2020 at 5:43 PM Ravi Kanth <ravikanth....@gmail.com> wrote:

> I forgot to mention:
>
> When inserting to table B from table A I am setting PARQUET_FILE_SIZE
> option to 256MB.
>
>
>
> On Tue, 17 Mar 2020 at 17:40, Ravi Kanth <ravikanth....@gmail.com> wrote:
>
>> Hi Community,
>>
>> I have a table(say A) that has 1000 small files of 100 MB each. I want to
>> create another table(say B) using the same data from A to generate 256 MB
>> files to match with our HDFS block.
>>
>> I am doing *insert into select *. from A *but this generates me 3000
>> small files of 30MB each. One of the reasons being the computation is
>> happening on 20 daemons.
>>
>> Upon reading the documentation its suggested *setting num_node=1* to
>> disable distributed tasks getting submitted to multiple nodes. I have two
>> more problems here:
>> 1.  This is not optimal as one node is overwhelmed with computations.
>> Often can run out on scratch space limits at production loads.
>> 2. This still doesn't give me 256 MB files.
>>
>> In the above context, my questions are:
>> 1. Can impala perform a reduce operation on the data from multiple nodes
>> to write a single 256MB file?
>> 2. Is there any other way I can use to generate these files of 256MB each?
>>
>> The current workaround is using Hive. It works perfectly but I want to
>> make use of impala capabilities in doing so.
>>
>> Impala Version: 2.12.0
>>
>> Thanks in advance,
>> Ravi
>>
>

Re: Generating a fixed size parquet file when doing Insert select *

Reply via email to