[ https://issues.apache.org/jira/browse/HIVE-74?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rajesh Balamohan reassigned HIVE-74: ------------------------------------ Assignee: Rajesh Balamohan (was: Namit Jain) > Hive can use CombineFileInputFormat for when the input are many small files > --------------------------------------------------------------------------- > > Key: HIVE-74 > URL: https://issues.apache.org/jira/browse/HIVE-74 > Project: Hive > Issue Type: Improvement > Components: Query Processor > Reporter: dhruba borthakur > Assignee: Rajesh Balamohan > Fix For: 0.5.0 > > Attachments: hive.74.1.patch, hive.74.2.patch, > hiveCombineSplit.patch, hiveCombineSplit.patch, hiveCombineSplit2.patch > > > There are cases when the input to a Hive job are thousands of small files. In > this case, there is a mapper for each file. Most of the overhead for spawning > all these mappers can be avoided if Hive used CombineFileInputFormat > introduced via HADOOP-4565 > Options to control this behavior: > {code} > hive.input.format (org.apache.hadoop.hive.ql.io.CombineHiveInputFormat > (default, if empty), or org.apache.hadoop.hive.ql.io.HiveInputFormat) > mapred.min.split.size.per.node (the minimum bytes of data to create a > node-local partition, otherwise the data will combine to rack level. > default:0) > mapred.min.split.size.per.rack (the minimum bytes of data to create a > rack-local partition, otherwise the data will combine to global level. > default:0) > mapred.max.split.size (the max size of each split, will be exceeded because > we stop accumulating *after* reaching it, instead of before) > {code} > The 3 numbers above must be in non-descending order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)