ok found my own answer via: https://www.ericlin.me/2016/03/hive-dynamic-insert-query-only-uses-1-reducer-out-of-thousands-of-reducers/
This setting gets rid of the last reduce phase in my insert: set hive.optimize.sort.dynamic.partition=false; Now I get as many files in my partition as I have reducers. I can now control the file size and the number of reducers with: hive.exec.reducers.bytes.per.reducer I'm using Parquet with Snappy compression for my data. This reduces the file size with about 0.5. So I set the bytes per reducer to 2x what I want to get my preferred file size. Hope this benefits anyone else. Op ma 24 sep. 2018 om 18:22 schreef Patrick Duin <patd...@gmail.com>: > Hi all, > > I got a query doing an insert overwrite like this: > > WITH tbl1 AS ( > SELECT > col0, col1, local_date, local_hour > FROM tbl1 > WHERE .... > ), > tbl2 AS ( > SELECT col0, col1, local_date, local_hour > FROM tbl2 > WHERE .... > ) > INSERT OVERWRITE TABLE > tbl3 > PARTITION (local_date, local_hour) > SELECT * FROM tbl1 > UNION DISTINCT > SELECT * FROM tbl2 > > Each partition contains ~15GB of compressed Parquet data.. The input > tables are AVRO. > I'm running on hive-2.3.2 using Tez on EMR in AWS. > My problem is that the final reduce phase spins up with x reducers where x > is let's say 60. If my query has only 3 partitions to write, all reducers > are done very quickly except 3 of them that write the file. > This makes the whole process very slow and also creates one very large > part file in each output partition. > I've been trying to control the size of the files and force multiple > reducers writing to the same output partition. > > Settings I've been using: > set hive.exec.dynamic.partition=false; > set hive.exec.dynamic.partition.mode=nonstrict; > set hive.exec.max.dynamic.partitions.pernode=2000; > set hive.exec.max.dynamic.partitions=20000; > set mapred.reduce.task=9; > > any ideas? > > Thanks, > Patrick >