Re: Implementing a custom StorageHandler

Gabriel Balan Wed, 06 Jul 2016 15:10:08 -0700

Hi

What is the difference between org.apache.Hadoop.mapred.InputFormat and 
org.apache.hadoop.mapreduce.InputFormat?


There are two sets of APIs: the old (in the "mapred" package) and the new (in the 
"mapreduce" package).

The old was deprecated, as the new was meant to replace it. But then the old 
API got undeprecated, and now they're both maintained.

When building a /hadoop/ job, pick all bits and pieces (input format, mapper, 
reducer) from the same API.

When dealing with /hive/, you want the /mapred/ API.

How is numSpits calculated in 
org.apache.Hadoop.mapred.InputFormat.getSplits(JobConf job, int numSplits)?

The numSplits arg is a hint, you can return a different number of splits.

The value of the numSplits arg comes from the conf, and you can set it with -D, 
-conf, or through FileInputFormat.setNumMapTasks 
<https://hadoop.apache.org/docs/r2.6.2/api/src-html/org/apache/hadoop/mapred/JobConf.html#line.1335>(int
 n):

Set the number of map tasks for this job.

/Note/: This is only a /hint/ to the framework. The actual number of spawned map tasks depends on 
the number of |InputSplit| 
<https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/mapred/InputSplit.html>s 
generated by the job's |InputFormat.getSplits(JobConf, int)| 
<https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/mapred/InputFormat.html#getSplits%28org.apache.hadoop.mapred.JobConf,%20int%29>.
 A custom |InputFormat| 
<https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/mapred/InputFormat.html> is 
typically used to accurately control the number of map tasks for the job.

Heads up #1. For mapred file input formats, the user specifies the number of 
splits. For mapreduce file input formats, the user can control the number of 
splits by specifying the lower and upper bounds on the size of splits.

Heads up #2. Most FileInputFormat implementations will give you 1 or more 
splits for each file in the input set. Hive will try to use a /Combine/ input 
format, which combines small files/splits into larger splits.

hth

Gabriel Balan

On 6/27/2016 6:59 PM, Long, Andrew wrote:


Hello everyone,

I’m in the process of implementing a custom StorageHandler and I had some 
questions.

1)What is the difference between org.apache.Hadoop.mapred.InputFormat and 
org.apache.hadoop.mapreduce.InputFormat?

2)How is numSpits calculated in 
org.apache.Hadoop.mapred.InputFormat.getSplits(JobConf job, int numSplits)?

3)Is there a way to enforce a maximum number of splits?  What would happen if I 
ignore numSplits and just returned an array of splits that was the actual 
maximum number of splits?

4)How is InputSplit.getLocations() used?  If I’m accessing non hfds resources 
should what should I return?  Currently I’m just returning an empty array.

Thanks for your time,

Andrew Long


--
The statements and opinions expressed here are my own and do not necessarily 
represent those of Oracle Corporation.

Re: Implementing a custom StorageHandler

Reply via email to