I believe that statement about the estimation is true. PARQUET_FILE_SIZE is also an upper bound and depends on the amount of data being written to that partition on a particular impala daemon - if you have less than a file's worth of data written on that node, you will only get a single (maybe small) file.
On Wed, Mar 25, 2020 at 4:06 AM Antoni Ivanov <aiva...@vmware.com> wrote: > Hi, > > > > Impala team can correct me but > > > > Even if you specify PARQUET_FILE_SIZE to 256MB Impala may and likely will > create smaller files (e.g 128MB or even smaller). > > As far as I could understand, that’s because when Impala is writing the > parquet file, it’s making a guess about the potential file size after > written, and it cannot correctly account for compression and encoding > effectiveness. So for example, for tables with many columns, we can see > files that are as low as 32MBs (with parquet file size set to 256) even. > > > > If you want to have strict control over the exact file size , I do not > think it’s possible with Impala currently? I may be wrong though ? > > > > The /*+shuffle*/ Is ineed more scalable as it enables to both have all > nodes read the data instead of one node, and more than 1 node to write the > data if you update more than one partition. At least that’s my > understanding. > > > > > > *From:* Tim Armstrong <tarmstr...@cloudera.com> > *Sent:* Wednesday, March 18, 2020 8:35 AM > *To:* user@impala.apache.org > *Subject:* Re: Generating a fixed size parquet file when doing Insert > select * > > > > Hi Ravi, > > There's a few details that could help understand the problem better. Is > the destination table partitioned? How many partitions does it have? If you > could share some query profiles or at least the explain plans from the > different queries you're running that would be helpful too. > > > > This is a guess, but the /*+shuffle*/ hint for the insert might solve your > problem - it forces redistribution of the data based on the partition key, > so all of the data for each partition will land on the same node. Impala's > planner tries to be intelligent about whether to redistribute data when > inserting into partitioned tables, but sometimes the decision won't be > right for your needs. Here are example plans of a query that skips the > shuffle without the hint. > > > > [localhost:21000] functional> explain insert into table alltypesinsert > partition (year, month) > select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col, > float_col, double_col, date_string_col, string_col, timestamp_col, year, > month > from alltypessmall; > Query: explain insert into table alltypesinsert > partition (year, month) > select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col, > float_col, double_col, date_string_col, string_col, timestamp_col, year, > month > from alltypessmall > > +-----------------------------------------------------------------------------------------+ > | Explain String > | > > +-----------------------------------------------------------------------------------------+ > | Max Per-Host Resource Reservation: Memory=12.01MB Threads=2 > | > | Per-Host Resource Estimates: Memory=92MB > | > | Codegen disabled by planner > | > | > | > | WRITE TO HDFS [functional.alltypesinsert, OVERWRITE=false, > PARTITION-KEYS=(year,month)] | > | | partitions=4 > | > | | > | > | 01:SORT > | > | | order by: year ASC NULLS LAST, month ASC NULLS LAST > | > | | row-size=89B cardinality=100 > | > | | > | > | 00:SCAN HDFS [functional.alltypessmall] > | > | HDFS partitions=4/4 files=4 size=6.32KB > | > | row-size=89B cardinality=100 > | > > +-----------------------------------------------------------------------------------------+ > > > > [localhost:21000] functional> explain insert into table alltypesinsert > partition (year, month) /*+shuffle*/ > select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col, > float_col, double_col, date_string_col, string_col, timestamp_col, year, > month > from alltypessmall; > Query: explain insert into table alltypesinsert > partition (year, month) /*+shuffle*/ > select id, bool_col, tinyint_col, smallint_col, int_col, bigint_col, > float_col, double_col, date_string_col, string_col, timestamp_col, year, > month > from alltypessmall > > +-----------------------------------------------------------------------------------------+ > | Explain String > | > > +-----------------------------------------------------------------------------------------+ > | Max Per-Host Resource Reservation: Memory=12.01MB Threads=3 > | > | Per-Host Resource Estimates: Memory=92MB > | > | Codegen disabled by planner > | > | > | > | WRITE TO HDFS [functional.alltypesinsert, OVERWRITE=false, > PARTITION-KEYS=(year,month)] | > | | partitions=4 > | > | | > | > | 02:SORT > | > | | order by: year ASC NULLS LAST, month ASC NULLS LAST > | > | | row-size=89B cardinality=100 > | > | | > | > | 01:EXCHANGE [HASH(`year`,`month`)] > | > | | > | > | 00:SCAN HDFS [functional.alltypessmall] > | > | HDFS partitions=4/4 files=4 size=6.32KB > | > | row-size=89B cardinality=100 > | > > +-----------------------------------------------------------------------------------------+ > > > > On Tue, Mar 17, 2020 at 5:43 PM Ravi Kanth <ravikanth....@gmail.com> > wrote: > > I forgot to mention: > > > > When inserting to table B from table A I am setting PARQUET_FILE_SIZE > option to 256MB. > > > > > > > > On Tue, 17 Mar 2020 at 17:40, Ravi Kanth <ravikanth....@gmail.com> wrote: > > Hi Community, > > > > I have a table(say A) that has 1000 small files of 100 MB each. I want to > create another table(say B) using the same data from A to generate 256 MB > files to match with our HDFS block. > > > > I am doing *insert into select *. from A *but this generates me 3000 > small files of 30MB each. One of the reasons being the computation is > happening on 20 daemons. > > > > Upon reading the documentation its suggested *setting num_node=1* to > disable distributed tasks getting submitted to multiple nodes. I have two > more problems here: > > 1. This is not optimal as one node is overwhelmed with computations. > Often can run out on scratch space limits at production loads. > > 2. This still doesn't give me 256 MB files. > > > > In the above context, my questions are: > > 1. Can impala perform a reduce operation on the data from multiple nodes > to write a single 256MB file? > > 2. Is there any other way I can use to generate these files of 256MB each? > > > > The current workaround is using Hive. It works perfectly but I want to > make use of impala capabilities in doing so. > > > > Impala Version: 2.12.0 > > > > Thanks in advance, > > Ravi > >