Hi, We are facing some issues with hive jobs while inserting into tables. It creating a very large file (single file of 100s of GB) per partition. We were able to create multiple files (size ~ 200-300 MB) on the older hive (V 0.11) on AWS emr version 2.4.2. But once we upgraded to the new version (Hive 2.1.1, emr-5.3.1 ) it is creating a single file of 100s of GB inside each partition. we had tried the following hive settings and still not able to figure it out how to split the single file into multiple files of smaller size. PFB setting we have tried.
1. Mentioned number of reducers set hive.exec.dynamic.partition=true; set mapred.reduce.tasks = 1000; set hive.exec.max.dynamic.partitions.pernode=20000; set hive.exec.max.dynamic.partitions=20000; set hive.merge.mapredfiles = False; set mapred.max.split.size=1048576; set mapred.min.split.size=1048576; 2. Compression using gzip set hive.exec.orc.default.compress = gzip; set mapreduce.input.fileinputformat.split.maxsize=1048576; set mapreduce.input.fileinputformat.split.minsize=1048576; set hive.exec.compress.output=true; set hive.mapjoin.smalltable.filesize = 2500; set mapred.output.compression.type=BLOCK; set hive.merge.size.per.task=10; set hive.merge.smallfiles.avgsize=10000; set hive.merge.mapredfiles=false; set hive.merge.mapfiles =true; 3. Compression using snappy and try to split to 100 MB files set hive.exec.compress.output=true; set mapred.max.split.size=100000000; set mapred.output.compression.type=BLOCK; set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; set hive.merge.mapredfiles=true; set hive.merge.size.per.task=100000000; set hive.merge.smallfiles.avgsize=100000000; We have tried all these options and nothing worked as desired. We have found one more setting in the hive, hive.hadoop.supports.splittable.combineinputformat, which have the default value is false and was added in the hive0.6 version. But we got an error (Query returned non-zero code: 1, cause: hive configuration hive.hadoop.supports.splittable.combineinputformat does not exist.) while trying in the emr 5.3.1. Can you please let us know, is there any corresponding setting in new Hadoop version since the above one shows depreciated in Hadoop 2.x version. Apart from these settings we have the below global settings, which we set for all scripts. set hive.execution.engine=tez; set hive.exec.dynamic.partition.mode=nonstrict; set hive.metastore.client.socket.timeout=2h; set hive.support.sql11.reserved.keywords=false; set hive.msck.path.validation=ignore; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions.pernode=20000; set hive.exec.max.dynamic.partitions=20000; Please have a look at the above settings and let us know how can we resolve the issue or any other workaround for the same. Any light to this issue will be grateful. Thanks, Nandan