Re: Generating a fixed size parquet file when doing Insert select *

2020-03-25 Thread Tim Armstrong
I believe that statement about the estimation is true. PARQUET_FILE_SIZE is
also an upper bound and depends on the amount of data being written to that
partition on a particular impala daemon - if you have less than a file's
worth of data written on that node, you will only get a single (maybe
small) file.

On Wed, Mar 25, 2020 at 4:06 AM Antoni Ivanov  wrote:

> Hi,
>
>
>
> Impala team can correct me but
>
>
>
> Even if you specify PARQUET_FILE_SIZE to 256MB Impala may and likely will
> create smaller files (e.g 128MB or even smaller).
>
> As far as I could understand, that’s because when Impala is writing the
> parquet file, it’s making a guess about the potential file size after
> written, and it cannot correctly account for compression and encoding
> effectiveness. So for example, for tables with many columns, we can see
> files that are as low as 32MBs (with parquet file size set to 256) even.
>
>
>
> If you want to have strict control over the exact file size , I do not
> think it’s possible with Impala currently? I may be wrong though ?
>
>
>
> The /*+shuffle*/ Is ineed more scalable as it enables to both have all
> nodes read the data instead of one node, and more than 1 node to write the
> data if you update more than one partition. At least that’s my
> understanding.
>
>
>
>
>
> *From:* Tim Armstrong 
> *Sent:* Wednesday, March 18, 2020 8:35 AM
> *To:* user@impala.apache.org
> *Subject:* Re: Generating a fixed size parquet file when doing Insert
> select *
>
>
>
> Hi Ravi,
>
>   There's a few details that could help understand the problem better. Is
> the destination table partitioned? How many partitions does it have? If you
> could share some query profiles or at least the explain plans from the
> different queries you're running that would be helpful too.
>
>
>
> This is a guess, but the /*+shuffle*/ hint for the insert might solve your
> problem - it forces redistribution of the data based on the partition key,
> so all of the data for each partition will land on the same node. Impala's
> planner tries to be intelligent about whether to redistribute data when
> inserting into partitioned tables, but sometimes the decision won't be
> right for your needs. Here are example plans of a query that skips the
> shuffle without the hint.
>
>
>
> [localhost:21000] functional> explain insert into table alltypesinsert
> partition (year, month)
> select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
> float_col, double_col, date_string_col, string_col, timestamp_col, year,
> month
> from alltypessmall;
> Query: explain insert into table alltypesinsert
> partition (year, month)
> select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
> float_col, double_col, date_string_col, string_col, timestamp_col, year,
> month
> from alltypessmall
>
> +-+
> | Explain String
>|
>
> +-+
> | Max Per-Host Resource Reservation: Memory=12.01MB Threads=2
> |
> | Per-Host Resource Estimates: Memory=92MB
>|
> | Codegen disabled by planner
> |
> |
> |
> | WRITE TO HDFS [functional.alltypesinsert, OVERWRITE=false,
> PARTITION-KEYS=(year,month)] |
> | |  partitions=4
> |
> | |
> |
> | 01:SORT
> |
> | |  order by: year ASC NULLS LAST, month ASC NULLS LAST
>|
> | |  row-size=89B cardinality=100
> |
> | |
> |
> | 00:SCAN HDFS [functional.alltypessmall]
> |
> |HDFS partitions=4/4 files=4 size=6.32KB
>|
> |row-size=89B cardinality=100
> |
>
> +-+
>
>
>
> [localhost:21000] functional> explain insert into table alltypesinsert
> partition (year, month) /*+shuffle*/
> select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
> float_col, double_col, date_string_col, string_col, timestamp_col, year,
> month
> from alltypessmall;
> Query: explain insert into table alltypesinsert
> partition (year, month) /*+shuffle*/
> select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
> float_col, double_col, date_string_col, string_col, timestamp_col, year,
> month
> from alltypessmall
>
> +-+
> | Explain String
> 

RE: Generating a fixed size parquet file when doing Insert select *

2020-03-25 Thread Antoni Ivanov
Hi,

Impala team can correct me but

Even if you specify PARQUET_FILE_SIZE to 256MB Impala may and likely will 
create smaller files (e.g 128MB or even smaller).
As far as I could understand, that’s because when Impala is writing the parquet 
file, it’s making a guess about the potential file size after written, and it 
cannot correctly account for compression and encoding effectiveness. So for 
example, for tables with many columns, we can see files that are as low as 
32MBs (with parquet file size set to 256) even.

If you want to have strict control over the exact file size , I do not think 
it’s possible with Impala currently? I may be wrong though ?

The /*+shuffle*/ Is ineed more scalable as it enables to both have all nodes 
read the data instead of one node, and more than 1 node to write the data if 
you update more than one partition. At least that’s my understanding.


From: Tim Armstrong 
Sent: Wednesday, March 18, 2020 8:35 AM
To: user@impala.apache.org
Subject: Re: Generating a fixed size parquet file when doing Insert select *

Hi Ravi,
  There's a few details that could help understand the problem better. Is the 
destination table partitioned? How many partitions does it have? If you could 
share some query profiles or at least the explain plans from the different 
queries you're running that would be helpful too.

This is a guess, but the /*+shuffle*/ hint for the insert might solve your 
problem - it forces redistribution of the data based on the partition key, so 
all of the data for each partition will land on the same node. Impala's planner 
tries to be intelligent about whether to redistribute data when inserting into 
partitioned tables, but sometimes the decision won't be right for your needs. 
Here are example plans of a query that skips the shuffle without the hint.

[localhost:21000] functional> explain insert into table alltypesinsert
partition (year, month)
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year, month
from alltypessmall;
Query: explain insert into table alltypesinsert
partition (year, month)
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year, month
from alltypessmall
+-+
| Explain String
  |
+-+
| Max Per-Host Resource Reservation: Memory=12.01MB Threads=2   
  |
| Per-Host Resource Estimates: Memory=92MB  
  |
| Codegen disabled by planner   
  |
|   
  |
| WRITE TO HDFS [functional.alltypesinsert, OVERWRITE=false, 
PARTITION-KEYS=(year,month)] |
| |  partitions=4   
  |
| | 
  |
| 01:SORT   
  |
| |  order by: year ASC NULLS LAST, month ASC NULLS LAST
  |
| |  row-size=89B cardinality=100   
  |
| | 
  |
| 00:SCAN HDFS [functional.alltypessmall]   
  |
|HDFS partitions=4/4 files=4 size=6.32KB
  |
|row-size=89B cardinality=100   
  |
+-+

[localhost:21000] functional> explain insert into table alltypesinsert
partition (year, month) /*+shuffle*/
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year, month
from alltypessmall;
Query: explain insert into table alltypesinsert
partition (year, month) /*+shuffle*/
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year, month
from alltypessmall
+-+
| Explain String
  |
+-+
| Max Per-Host Resource Reservation: Memory=12.01MB Threads=3   
  |
| Per-Host Resource Estimates: Memory=92MB  

Re: Generating a fixed size parquet file when doing Insert select *

2020-03-17 Thread Tim Armstrong
Hi Ravi,
  There's a few details that could help understand the problem better. Is
the destination table partitioned? How many partitions does it have? If you
could share some query profiles or at least the explain plans from the
different queries you're running that would be helpful too.

This is a guess, but the /*+shuffle*/ hint for the insert might solve your
problem - it forces redistribution of the data based on the partition key,
so all of the data for each partition will land on the same node. Impala's
planner tries to be intelligent about whether to redistribute data when
inserting into partitioned tables, but sometimes the decision won't be
right for your needs. Here are example plans of a query that skips the
shuffle without the hint.

[localhost:21000] functional> explain insert into table alltypesinsert
partition (year, month)
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year,
month
from alltypessmall;
Query: explain insert into table alltypesinsert
partition (year, month)
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year,
month
from alltypessmall
+-+
| Explain String
   |
+-+
| Max Per-Host Resource Reservation: Memory=12.01MB Threads=2
  |
| Per-Host Resource Estimates: Memory=92MB
   |
| Codegen disabled by planner
  |
|
  |
| WRITE TO HDFS [functional.alltypesinsert, OVERWRITE=false,
PARTITION-KEYS=(year,month)] |
| |  partitions=4
  |
| |
  |
| 01:SORT
  |
| |  order by: year ASC NULLS LAST, month ASC NULLS LAST
   |
| |  row-size=89B cardinality=100
  |
| |
  |
| 00:SCAN HDFS [functional.alltypessmall]
  |
|HDFS partitions=4/4 files=4 size=6.32KB
   |
|row-size=89B cardinality=100
  |
+-+

[localhost:21000] functional> explain insert into table alltypesinsert
partition (year, month) /*+shuffle*/
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year,
month
from alltypessmall;
Query: explain insert into table alltypesinsert
partition (year, month) /*+shuffle*/
select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
float_col, double_col, date_string_col, string_col, timestamp_col, year,
month
from alltypessmall
+-+
| Explain String
   |
+-+
| Max Per-Host Resource Reservation: Memory=12.01MB Threads=3
  |
| Per-Host Resource Estimates: Memory=92MB
   |
| Codegen disabled by planner
  |
|
  |
| WRITE TO HDFS [functional.alltypesinsert, OVERWRITE=false,
PARTITION-KEYS=(year,month)] |
| |  partitions=4
  |
| |
  |
| 02:SORT
  |
| |  order by: year ASC NULLS LAST, month ASC NULLS LAST
   |
| |  row-size=89B cardinality=100
  |
| |
  |
| 01:EXCHANGE [HASH(`year`,`month`)]
   |
| |
  |
| 00:SCAN HDFS [functional.alltypessmall]
  |
|HDFS partitions=4/4 files=4 size=6.32KB
   |
|row-size=89B cardinality=100
  |
+-+

On Tue, Mar 17, 2020 at 5:43 PM Ravi Kanth  wrote:

> I forgot to mention:
>
> When inserting to table B from table A I am setting PARQUET_FILE_SIZE
> option to 256MB.
>
>
>
> On Tue, 17 Mar 2020 at 17:40, Ravi Kanth  wrote:
>
>> Hi Community,
>>
>> I have a table(say A) that has 1000 small files of 100 MB each. I want to
>> create another table(say B) using the same data from A to generate 256 MB
>> files to match with our HDFS block.
>>
>> I am doing *insert into select *. from A *but this generates me 3000
>> small files of 30MB each. One of the reasons being the computation is
>> happening on 20 daemons.
>>
>> Upon reading the documentation its suggested *setting num_node=1* to
>> disable distributed tasks getting submitted to multiple nodes. I have two
>> more problems here:
>> 1.  This is not optimal as one node is overwhelmed with computations.
>> Often can run out on scratch space limits at production loads.
>> 2. This still doesn't give me 256 MB files.
>>
>> In the above context, my questions are:
>> 1. Can impala perform a reduce operation on the data from multiple nodes
>> to write a single 256MB file?
>> 2. Is

Re: Generating a fixed size parquet file when doing Insert select *

2020-03-17 Thread Ravi Kanth
I forgot to mention:

When inserting to table B from table A I am setting PARQUET_FILE_SIZE
option to 256MB.



On Tue, 17 Mar 2020 at 17:40, Ravi Kanth  wrote:

> Hi Community,
>
> I have a table(say A) that has 1000 small files of 100 MB each. I want to
> create another table(say B) using the same data from A to generate 256 MB
> files to match with our HDFS block.
>
> I am doing *insert into select *. from A *but this generates me 3000
> small files of 30MB each. One of the reasons being the computation is
> happening on 20 daemons.
>
> Upon reading the documentation its suggested *setting num_node=1* to
> disable distributed tasks getting submitted to multiple nodes. I have two
> more problems here:
> 1.  This is not optimal as one node is overwhelmed with computations.
> Often can run out on scratch space limits at production loads.
> 2. This still doesn't give me 256 MB files.
>
> In the above context, my questions are:
> 1. Can impala perform a reduce operation on the data from multiple nodes
> to write a single 256MB file?
> 2. Is there any other way I can use to generate these files of 256MB each?
>
> The current workaround is using Hive. It works perfectly but I want to
> make use of impala capabilities in doing so.
>
> Impala Version: 2.12.0
>
> Thanks in advance,
> Ravi
>