ok found my own answer via:
https://www.ericlin.me/2016/03/hive-dynamic-insert-query-only-uses-1-reducer-out-of-thousands-of-reducers/

This setting gets rid of the last reduce phase in my insert:
set hive.optimize.sort.dynamic.partition=false;

Now I get as many files in my partition as I have reducers.
I can now control the file size and the number of reducers with:
hive.exec.reducers.bytes.per.reducer
I'm using Parquet with Snappy compression for my data. This reduces the
file size with about 0.5. So I set the bytes per reducer to 2x what I want
to get my preferred file size.

Hope this benefits anyone else.



Op ma 24 sep. 2018 om 18:22 schreef Patrick Duin <patd...@gmail.com>:

> Hi all,
>
> I got a query doing an insert overwrite like this:
>
> WITH tbl1 AS (
>   SELECT
>    col0, col1, local_date, local_hour
>   FROM tbl1
>   WHERE ....
>   ),
> tbl2 AS (
>   SELECT col0, col1, local_date, local_hour
>   FROM tbl2
>   WHERE ....
> )
> INSERT OVERWRITE TABLE
>   tbl3
> PARTITION (local_date, local_hour)
>   SELECT * FROM tbl1
>   UNION DISTINCT
>   SELECT * FROM tbl2
>
> Each partition contains ~15GB of compressed Parquet data.. The input
> tables are AVRO.
> I'm running on hive-2.3.2 using Tez on EMR in AWS.
> My problem is that the final reduce phase spins up with x reducers where x
> is let's say 60. If my query has only 3 partitions to write, all reducers
> are done very quickly except 3 of them that write the file.
> This makes the whole process very slow and also creates one very large
> part file in each output partition.
> I've been trying to control the size of the files and force multiple
> reducers writing to the same output partition.
>
> Settings I've been using:
> set hive.exec.dynamic.partition=false;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.max.dynamic.partitions.pernode=2000;
> set hive.exec.max.dynamic.partitions=20000;
> set mapred.reduce.task=9;
>
> any ideas?
>
> Thanks,
>  Patrick
>

Reply via email to