Hi
What is the difference between org.apache.Hadoop.mapred.InputFormat and org.apache.hadoop.mapreduce.InputFormat?
There are two sets of APIs: the old (in the "mapred" package) and the new (in the "mapreduce" package). The old was deprecated, as the new was meant to replace it. But then the old API got undeprecated, and now they're both maintained. When building a /hadoop/ job, pick all bits and pieces (input format, mapper, reducer) from the same API. When dealing with /hive/, you want the /mapred/ API.
How is numSpits calculated in org.apache.Hadoop.mapred.InputFormat.getSplits(JobConf job, int numSplits)?
The numSplits arg is a hint, you can return a different number of splits. The value of the numSplits arg comes from the conf, and you can set it with -D, -conf, or through FileInputFormat.setNumMapTasks <https://hadoop.apache.org/docs/r2.6.2/api/src-html/org/apache/hadoop/mapred/JobConf.html#line.1335>(int n):
Set the number of map tasks for this job. /Note/: This is only a /hint/ to the framework. The actual number of spawned map tasks depends on the number of |InputSplit| <https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/mapred/InputSplit.html>s generated by the job's |InputFormat.getSplits(JobConf, int)| <https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/mapred/InputFormat.html#getSplits%28org.apache.hadoop.mapred.JobConf,%20int%29>. A custom |InputFormat| <https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/mapred/InputFormat.html> is typically used to accurately control the number of map tasks for the job.
Heads up #1. For mapred file input formats, the user specifies the number of splits. For mapreduce file input formats, the user can control the number of splits by specifying the lower and upper bounds on the size of splits. Heads up #2. Most FileInputFormat implementations will give you 1 or more splits for each file in the input set. Hive will try to use a /Combine/ input format, which combines small files/splits into larger splits. hth Gabriel Balan On 6/27/2016 6:59 PM, Long, Andrew wrote:
Hello everyone, I’m in the process of implementing a custom StorageHandler and I had some questions. 1)What is the difference between org.apache.Hadoop.mapred.InputFormat and org.apache.hadoop.mapreduce.InputFormat? 2)How is numSpits calculated in org.apache.Hadoop.mapred.InputFormat.getSplits(JobConf job, int numSplits)? 3)Is there a way to enforce a maximum number of splits? What would happen if I ignore numSplits and just returned an array of splits that was the actual maximum number of splits? 4)How is InputSplit.getLocations() used? If I’m accessing non hfds resources should what should I return? Currently I’m just returning an empty array. Thanks for your time, Andrew Long
-- The statements and opinions expressed here are my own and do not necessarily represent those of Oracle Corporation.
