Hey Gabriel, “The numSplits arg is a hint, you can return a different number of splits.“
“Heads up #1. For mapred file input formats, the user specifies the number of splits. For mapreduce file input formats, the user can control the number of splits by specifying the lower and upper bounds on the size of splits.” “Heads up #2. Most FileInputFormat implementations will give you 1 or more splits for each file in the input set. Hive will try to use a Combine input format, which combines small files/splits into larger splits.“ In the end, I ended up forcing the maximum number of splits. Realizing the the numSplits is more of a suggestion than anything else was an epiphany. This seems to have fixed my mapper count problem. Unfortunately, hive is bad at guessing the size of tables it doesn’t have the correct stats for, which results in it always choosing a single reducer. Fortunately, I can run ANALYZE table COMPUTE STATISTICS to fix that problem. Cheers Andrew From: Gabriel Balan <[email protected]> Organization: Oracle Corporation Reply-To: "[email protected]" <[email protected]> Date: Wednesday, July 6, 2016 at 3:08 PM To: "[email protected]" <[email protected]> Subject: Re: Implementing a custom StorageHandler Hi What is the difference between org.apache.Hadoop.mapred.InputFormat and org.apache.hadoop.mapreduce.InputFormat? There are two sets of APIs: the old (in the "mapred" package) and the new (in the "mapreduce" package). The old was deprecated, as the new was meant to replace it. But then the old API got undeprecated, and now they're both maintained. When building a hadoop job, pick all bits and pieces (input format, mapper, reducer) from the same API. When dealing with hive, you want the mapred API. How is numSpits calculated in org.apache.Hadoop.mapred.InputFormat.getSplits(JobConf job, int numSplits)? The numSplits arg is a hint, you can return a different number of splits. The value of the numSplits arg comes from the conf, and you can set it with -D, -conf, or through FileInputFormat.setNumMapTasks<https://hadoop.apache.org/docs/r2.6.2/api/src-html/org/apache/hadoop/mapred/JobConf.html#line.1335>(int n): Set the number of map tasks for this job. Note: This is only a hint to the framework. The actual number of spawned map tasks depends on the number of InputSplit<https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/mapred/InputSplit.html>s generated by the job's InputFormat.getSplits(JobConf, int)<https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/mapred/InputFormat.html#getSplits%28org.apache.hadoop.mapred.JobConf,%20int%29>. A custom InputFormat<https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/mapred/InputFormat.html> is typically used to accurately control the number of map tasks for the job. Heads up #1. For mapred file input formats, the user specifies the number of splits. For mapreduce file input formats, the user can control the number of splits by specifying the lower and upper bounds on the size of splits. Heads up #2. Most FileInputFormat implementations will give you 1 or more splits for each file in the input set. Hive will try to use a Combine input format, which combines small files/splits into larger splits. hth Gabriel Balan On 6/27/2016 6:59 PM, Long, Andrew wrote: Hello everyone, I’m in the process of implementing a custom StorageHandler and I had some questions. 1) What is the difference between org.apache.Hadoop.mapred.InputFormat and org.apache.hadoop.mapreduce.InputFormat? 2) How is numSpits calculated in org.apache.Hadoop.mapred.InputFormat.getSplits(JobConf job, int numSplits)? 3) Is there a way to enforce a maximum number of splits? What would happen if I ignore numSplits and just returned an array of splits that was the actual maximum number of splits? 4) How is InputSplit.getLocations() used? If I’m accessing non hfds resources should what should I return? Currently I’m just returning an empty array. Thanks for your time, Andrew Long -- The statements and opinions expressed here are my own and do not necessarily represent those of Oracle Corporation.
