Hi,

Impala team can correct me but

Even if you specify PARQUET_FILE_SIZE to 256MB Impala may and likely will 
create smaller files (e.g 128MB or even smaller).
As far as I could understand, that’s because when Impala is writing the parquet 
file, it’s making a guess about the potential file size after written, and it 
cannot correctly account for compression and encoding effectiveness. So for 
example, for tables with many columns, we can see files that are as low as 
32MBs (with parquet file size set to 256) even.

If you want to have strict control over the exact file size , I do not think 
it’s possible with Impala currently? I may be wrong though ?

The /*+shuffle*/ Is ineed more scalable as it enables to both have all nodes 
read the data instead of one node, and more than 1 node to write the data if 
you update more than one partition. At least that’s my understanding.


From: Tim Armstrong <tarmstr...@cloudera.com>
Sent: Wednesday, March 18, 2020 8:35 AM
To: user@impala.apache.org
Subject: Re: Generating a fixed size parquet file when doing Insert select *

Hi Ravi,
  There's a few details that could help understand the problem better. Is the 
destination table partitioned? How many partitions does it have? If you could 
share some query profiles or at least the explain plans from the different 
queries you're running that would be helpful too.

This is a guess, but the /*+shuffle*/ hint for the insert might solve your 
problem - it forces redistribution of the data based on the partition key, so 
all of the data for each partition will land on the same node. Impala's planner 
tries to be intelligent about whether to redistribute data when inserting into 
partitioned tables, but sometimes the decision won't be right for your needs. 
Here are example plans of a query that skips the shuffle without the hint.

[localhost:21000] functional> explain insert into table alltypesinsert
partition (year, month)
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year, month
from alltypessmall;
Query: explain insert into table alltypesinsert
partition (year, month)
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year, month
from alltypessmall
+-----------------------------------------------------------------------------------------+
| Explain String                                                                
          |
+-----------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=12.01MB Threads=2                   
          |
| Per-Host Resource Estimates: Memory=92MB                                      
          |
| Codegen disabled by planner                                                   
          |
|                                                                               
          |
| WRITE TO HDFS [functional.alltypesinsert, OVERWRITE=false, 
PARTITION-KEYS=(year,month)] |
| |  partitions=4                                                               
          |
| |                                                                             
          |
| 01:SORT                                                                       
          |
| |  order by: year ASC NULLS LAST, month ASC NULLS LAST                        
          |
| |  row-size=89B cardinality=100                                               
          |
| |                                                                             
          |
| 00:SCAN HDFS [functional.alltypessmall]                                       
          |
|    HDFS partitions=4/4 files=4 size=6.32KB                                    
          |
|    row-size=89B cardinality=100                                               
          |
+-----------------------------------------------------------------------------------------+

[localhost:21000] functional> explain insert into table alltypesinsert
partition (year, month) /*+shuffle*/
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year, month
from alltypessmall;
Query: explain insert into table alltypesinsert
partition (year, month) /*+shuffle*/
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year, month
from alltypessmall
+-----------------------------------------------------------------------------------------+
| Explain String                                                                
          |
+-----------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=12.01MB Threads=3                   
          |
| Per-Host Resource Estimates: Memory=92MB                                      
          |
| Codegen disabled by planner                                                   
          |
|                                                                               
          |
| WRITE TO HDFS [functional.alltypesinsert, OVERWRITE=false, 
PARTITION-KEYS=(year,month)] |
| |  partitions=4                                                               
          |
| |                                                                             
          |
| 02:SORT                                                                       
          |
| |  order by: year ASC NULLS LAST, month ASC NULLS LAST                        
          |
| |  row-size=89B cardinality=100                                               
          |
| |                                                                             
          |
| 01:EXCHANGE [HASH(`year`,`month`)]                                            
          |
| |                                                                             
          |
| 00:SCAN HDFS [functional.alltypessmall]                                       
          |
|    HDFS partitions=4/4 files=4 size=6.32KB                                    
          |
|    row-size=89B cardinality=100                                               
          |
+-----------------------------------------------------------------------------------------+

On Tue, Mar 17, 2020 at 5:43 PM Ravi Kanth 
<ravikanth....@gmail.com<mailto:ravikanth....@gmail.com>> wrote:
I forgot to mention:

When inserting to table B from table A I am setting PARQUET_FILE_SIZE option to 
256MB.



On Tue, 17 Mar 2020 at 17:40, Ravi Kanth 
<ravikanth....@gmail.com<mailto:ravikanth....@gmail.com>> wrote:
Hi Community,

I have a table(say A) that has 1000 small files of 100 MB each. I want to 
create another table(say B) using the same data from A to generate 256 MB files 
to match with our HDFS block.

I am doing insert into select *. from A but this generates me 3000 small files 
of 30MB each. One of the reasons being the computation is happening on 20 
daemons.

Upon reading the documentation its suggested setting num_node=1 to disable 
distributed tasks getting submitted to multiple nodes. I have two more problems 
here:
1.  This is not optimal as one node is overwhelmed with computations. Often can 
run out on scratch space limits at production loads.
2. This still doesn't give me 256 MB files.

In the above context, my questions are:
1. Can impala perform a reduce operation on the data from multiple nodes to 
write a single 256MB file?
2. Is there any other way I can use to generate these files of 256MB each?

The current workaround is using Hive. It works perfectly but I want to make use 
of impala capabilities in doing so.

Impala Version: 2.12.0

Thanks in advance,
Ravi

Reply via email to