Hi,
We are facing some issues with hive jobs while inserting into tables. It 
creating a very large file (single file of 100s of GB) per partition. We were 
able to create multiple files (size ~ 200-300 MB) on the older hive (V 0.11) on 
AWS emr version 2.4.2. But once we upgraded to the new version (Hive 2.1.1, 
emr-5.3.1 ) it is creating a single file of 100s of GB inside each partition. 
we had tried the following hive settings and still not able to figure it out 
how to split the single file into multiple files of smaller size. PFB setting 
we have tried.

1. Mentioned number of reducers
set hive.exec.dynamic.partition=true;
set mapred.reduce.tasks = 1000;
set hive.exec.max.dynamic.partitions.pernode=20000;
set hive.exec.max.dynamic.partitions=20000;
set hive.merge.mapredfiles = False;
set mapred.max.split.size=1048576;
set mapred.min.split.size=1048576;

2. Compression using gzip
set hive.exec.orc.default.compress = gzip;
set mapreduce.input.fileinputformat.split.maxsize=1048576;
set mapreduce.input.fileinputformat.split.minsize=1048576;
set hive.exec.compress.output=true;
set hive.mapjoin.smalltable.filesize = 2500;
set mapred.output.compression.type=BLOCK;
set hive.merge.size.per.task=10;
set hive.merge.smallfiles.avgsize=10000;
set hive.merge.mapredfiles=false;
set hive.merge.mapfiles =true;

3. Compression using snappy and try to split to 100 MB files
set hive.exec.compress.output=true;
set mapred.max.split.size=100000000;
set mapred.output.compression.type=BLOCK;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=100000000;
set hive.merge.smallfiles.avgsize=100000000;

We have tried all these options and nothing worked as desired. We have found 
one more setting in the hive,  
hive.hadoop.supports.splittable.combineinputformat, which have the default 
value is false and was added in the hive0.6 version. But we got an error (Query 
returned non-zero code: 1, cause: hive configuration 
hive.hadoop.supports.splittable.combineinputformat does not exist.) while 
trying in the emr 5.3.1. Can you please let us know, is there any corresponding 
setting in new Hadoop version since the above one shows depreciated in Hadoop 
2.x version.

Apart from these settings we have the below global settings, which we set for 
all scripts.

set hive.execution.engine=tez;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.metastore.client.socket.timeout=2h;
set hive.support.sql11.reserved.keywords=false;
set hive.msck.path.validation=ignore;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions.pernode=20000;
set hive.exec.max.dynamic.partitions=20000;

Please have a look at the above settings and let us know how can we resolve the 
issue or any other workaround for the same. Any light to this issue will be 
grateful.
Thanks,
Nandan
 

Reply via email to