Re: Generating a fixed size parquet file when doing Insert select *

Tim Armstrong Wed, 25 Mar 2020 15:14:16 -0700

I believe that statement about the estimation is true. PARQUET_FILE_SIZE is
also an upper bound and depends on the amount of data being written to that
partition on a particular impala daemon - if you have less than a file's
worth of data written on that node, you will only get a single (maybe
small) file.


On Wed, Mar 25, 2020 at 4:06 AM Antoni Ivanov <aiva...@vmware.com> wrote:

> Hi,
>
>
>
> Impala team can correct me but
>
>
>
> Even if you specify PARQUET_FILE_SIZE to 256MB Impala may and likely will
> create smaller files (e.g 128MB or even smaller).
>
> As far as I could understand, that’s because when Impala is writing the
> parquet file, it’s making a guess about the potential file size after
> written, and it cannot correctly account for compression and encoding
> effectiveness. So for example, for tables with many columns, we can see
> files that are as low as 32MBs (with parquet file size set to 256) even.
>
>
>
> If you want to have strict control over the exact file size , I do not
> think it’s possible with Impala currently? I may be wrong though ?
>
>
>
> The /*+shuffle*/ Is ineed more scalable as it enables to both have all
> nodes read the data instead of one node, and more than 1 node to write the
> data if you update more than one partition. At least that’s my
> understanding.
>
>
>
>
>
> *From:* Tim Armstrong <tarmstr...@cloudera.com>
> *Sent:* Wednesday, March 18, 2020 8:35 AM
> *To:* user@impala.apache.org
> *Subject:* Re: Generating a fixed size parquet file when doing Insert
> select *
>
>
>
> Hi Ravi,
>
>   There's a few details that could help understand the problem better. Is
> the destination table partitioned? How many partitions does it have? If you
> could share some query profiles or at least the explain plans from the
> different queries you're running that would be helpful too.
>
>
>
> This is a guess, but the /*+shuffle*/ hint for the insert might solve your
> problem - it forces redistribution of the data based on the partition key,
> so all of the data for each partition will land on the same node. Impala's
> planner tries to be intelligent about whether to redistribute data when
> inserting into partitioned tables, but sometimes the decision won't be
> right for your needs. Here are example plans of a query that skips the
> shuffle without the hint.
>
>
>
> [localhost:21000] functional> explain insert into table alltypesinsert
> partition (year, month)
> select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
> float_col, double_col, date_string_col, string_col, timestamp_col, year,
> month
> from alltypessmall;
> Query: explain insert into table alltypesinsert
> partition (year, month)
> select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
> float_col, double_col, date_string_col, string_col, timestamp_col, year,
> month
> from alltypessmall
>
> +-----------------------------------------------------------------------------------------+
> | Explain String
>                |
>
> +-----------------------------------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=12.01MB Threads=2
>                 |
> | Per-Host Resource Estimates: Memory=92MB
>                |
> | Codegen disabled by planner
>                 |
> |
>                 |
> | WRITE TO HDFS [functional.alltypesinsert, OVERWRITE=false,
> PARTITION-KEYS=(year,month)] |
> | |  partitions=4
>                 |
> | |
>                 |
> | 01:SORT
>                 |
> | |  order by: year ASC NULLS LAST, month ASC NULLS LAST
>                |
> | |  row-size=89B cardinality=100
>                 |
> | |
>                 |
> | 00:SCAN HDFS [functional.alltypessmall]
>                 |
> |    HDFS partitions=4/4 files=4 size=6.32KB
>                |
> |    row-size=89B cardinality=100
>                 |
>
> +-----------------------------------------------------------------------------------------+
>
>
>
> [localhost:21000] functional> explain insert into table alltypesinsert
> partition (year, month) /*+shuffle*/
> select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
> float_col, double_col, date_string_col, string_col, timestamp_col, year,
> month
> from alltypessmall;
> Query: explain insert into table alltypesinsert
> partition (year, month) /*+shuffle*/
> select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col,
> float_col, double_col, date_string_col, string_col, timestamp_col, year,
> month
> from alltypessmall
>
> +-----------------------------------------------------------------------------------------+
> | Explain String
>                |
>
> +-----------------------------------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=12.01MB Threads=3
>                 |
> | Per-Host Resource Estimates: Memory=92MB
>                |
> | Codegen disabled by planner
>                 |
> |
>                 |
> | WRITE TO HDFS [functional.alltypesinsert, OVERWRITE=false,
> PARTITION-KEYS=(year,month)] |
> | |  partitions=4
>                 |
> | |
>                 |
> | 02:SORT
>                 |
> | |  order by: year ASC NULLS LAST, month ASC NULLS LAST
>                |
> | |  row-size=89B cardinality=100
>                 |
> | |
>                 |
> | 01:EXCHANGE [HASH(`year`,`month`)]
>                |
> | |
>                 |
> | 00:SCAN HDFS [functional.alltypessmall]
>                 |
> |    HDFS partitions=4/4 files=4 size=6.32KB
>                |
> |    row-size=89B cardinality=100
>                 |
>
> +-----------------------------------------------------------------------------------------+
>
>
>
> On Tue, Mar 17, 2020 at 5:43 PM Ravi Kanth <ravikanth....@gmail.com>
> wrote:
>
> I forgot to mention:
>
>
>
> When inserting to table B from table A I am setting PARQUET_FILE_SIZE
> option to 256MB.
>
>
>
>
>
>
>
> On Tue, 17 Mar 2020 at 17:40, Ravi Kanth <ravikanth....@gmail.com> wrote:
>
> Hi Community,
>
>
>
> I have a table(say A) that has 1000 small files of 100 MB each. I want to
> create another table(say B) using the same data from A to generate 256 MB
> files to match with our HDFS block.
>
>
>
> I am doing *insert into select *. from A *but this generates me 3000
> small files of 30MB each. One of the reasons being the computation is
> happening on 20 daemons.
>
>
>
> Upon reading the documentation its suggested *setting num_node=1* to
> disable distributed tasks getting submitted to multiple nodes. I have two
> more problems here:
>
> 1.  This is not optimal as one node is overwhelmed with computations.
> Often can run out on scratch space limits at production loads.
>
> 2. This still doesn't give me 256 MB files.
>
>
>
> In the above context, my questions are:
>
> 1. Can impala perform a reduce operation on the data from multiple nodes
> to write a single 256MB file?
>
> 2. Is there any other way I can use to generate these files of 256MB each?
>
>
>
> The current workaround is using Hive. It works perfectly but I want to
> make use of impala capabilities in doing so.
>
>
>
> Impala Version: 2.12.0
>
>
>
> Thanks in advance,
>
> Ravi
>
>

Re: Generating a fixed size parquet file when doing Insert select *

Reply via email to