Hey Gabriel,

“The numSplits arg is a hint, you can return a different number of splits.“

“Heads up #1. For mapred file input formats, the user specifies the number of 
splits. For mapreduce file input formats, the user can control the number of 
splits by specifying the lower and upper bounds on the size of splits.”

“Heads up #2. Most FileInputFormat implementations will give you 1 or more 
splits for each file in the input set. Hive will try to use a Combine input 
format, which combines small files/splits into larger splits.“

In the end, I ended up forcing the maximum number of splits.  Realizing the the 
numSplits is more of a suggestion than anything else was an epiphany.  This 
seems to have fixed my mapper count problem.  Unfortunately, hive is bad at 
guessing the size of tables it doesn’t have the correct stats for, which 
results in it always choosing a single reducer.  Fortunately, I can run ANALYZE 
table COMPUTE STATISTICS to fix that problem.

Cheers Andrew



From: Gabriel Balan <[email protected]>
Organization: Oracle Corporation
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, July 6, 2016 at 3:08 PM
To: "[email protected]" <[email protected]>
Subject: Re: Implementing a custom StorageHandler


Hi
What is the difference between org.apache.Hadoop.mapred.InputFormat and 
org.apache.hadoop.mapreduce.InputFormat?

There are two sets of APIs: the old (in the "mapred" package) and the new (in 
the "mapreduce" package).

The old was deprecated, as the new was meant to replace it. But then the old 
API got undeprecated, and now they're both maintained.

When building a hadoop job, pick all bits and pieces (input format, mapper, 
reducer) from the same API.

When dealing with hive, you want the mapred API.
How is numSpits calculated in 
org.apache.Hadoop.mapred.InputFormat.getSplits(JobConf job, int numSplits)?
The numSplits arg is a hint, you can return a different number of splits.

The value of the numSplits arg comes from the conf, and you can set it with -D, 
-conf, or through 
FileInputFormat.setNumMapTasks<https://hadoop.apache.org/docs/r2.6.2/api/src-html/org/apache/hadoop/mapred/JobConf.html#line.1335>(int
 n):
Set the number of map tasks for this job.

Note: This is only a hint to the framework. The actual number of spawned map 
tasks depends on the number of 
InputSplit<https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/mapred/InputSplit.html>s
 generated by the job's InputFormat.getSplits(JobConf, 
int)<https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/mapred/InputFormat.html#getSplits%28org.apache.hadoop.mapred.JobConf,%20int%29>.
 A custom 
InputFormat<https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/mapred/InputFormat.html>
 is typically used to accurately control the number of map tasks for the job.

Heads up #1. For mapred file input formats, the user specifies the number of 
splits. For mapreduce file input formats, the user can control the number of 
splits by specifying the lower and upper bounds on the size of splits.

Heads up #2. Most FileInputFormat implementations will give you 1 or more 
splits for each file in the input set. Hive will try to use a Combine input 
format, which combines small files/splits into larger splits.

hth

Gabriel Balan
On 6/27/2016 6:59 PM, Long, Andrew wrote:
Hello everyone,

I’m in the process of implementing a custom StorageHandler and I had some 
questions.


1)      What is the difference between org.apache.Hadoop.mapred.InputFormat and 
org.apache.hadoop.mapreduce.InputFormat?

2)      How is numSpits calculated in 
org.apache.Hadoop.mapred.InputFormat.getSplits(JobConf job, int numSplits)?

3)      Is there a way to enforce a maximum number of splits?  What would 
happen if I ignore numSplits and just returned an array of splits that was the 
actual maximum number of splits?

4)      How is InputSplit.getLocations() used?  If I’m accessing non hfds 
resources should what should I return?  Currently I’m just returning an empty 
array.

Thanks for your time,
Andrew Long



--

The statements and opinions expressed here are my own and do not necessarily 
represent those of Oracle Corporation.

Reply via email to