Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

Ted Yu Tue, 03 Mar 2015 15:58:03 -0800

Thanks for the confirmation, Stephen.

On Tue, Mar 3, 2015 at 3:53 PM, Stephen Boesch <[email protected]> wrote:


> Thanks, I was looking at an old version of FileInputFormat..
>
> BEFORE setting the recursive config (
> mapreduce.input.fileinputformat.input.dir.recursive)
> scala> sc.textFile("dev/*").count
>      java.io.IOException: *Not a file*:
> file:/shared/sparkup/dev/audit-release/blank_maven_build
>
> The default is null/not set which is evaluated as "false":
>
> scala>
> sc.hadoopConfiguration.get("mapreduce.input.fileinputformat.input.dir.recursive")
>
> res1: String = null
>
>
> AFTER:
>
>
> Now set the value :
>
>
> sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
>
> scala>
> sc.hadoopConfiguration.get("mapreduce.input.fileinputformat.input.dir.recursive")
> res4: String = true
>
>
> scala>sc.textFile("dev/*").count
>
> ..
> res5: Long = 3481
>
>
> So it works.
>
> 2015-03-03 15:26 GMT-08:00 Ted Yu <[email protected]>:
>
> Looking at FileInputFormat#listStatus():
>>
>>     // Whether we need to recursive look into the directory structure
>>
>>     boolean recursive = job.getBoolean(INPUT_DIR_RECURSIVE, false);
>>
>> where:
>>
>>   public static final String INPUT_DIR_RECURSIVE =
>>
>>     "mapreduce.input.fileinputformat.input.dir.recursive";
>>
>> FYI
>>
>> On Tue, Mar 3, 2015 at 3:14 PM, Stephen Boesch <[email protected]> wrote:
>>
>>>
>>> The sc.textFile() invokes the Hadoop FileInputFormat via the (subclass)
>>> TextInputFormat.  Inside the logic does exist to do the recursive directory
>>> reading - i.e. first detecting if an entry were a directory and if so then
>>> descending:
>>>
>>>      for (FileStatus 
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus>
>>>  globStat: matches) {
>>>
>>> 218 
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#218>
>>>
>>>
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>>>
>>>          * if (globStat.isDir 
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus.isDir%28%29>())
>>>  {*
>>>
>>> *219
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#219>*
>>>
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>>>
>>>             for(FileStatus 
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus>
>>>  stat: f*s**.listStatus 
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileSystem.java#FileSystem.listStatus%28org.apache.hadoop.fs.Path%2Corg.apache.hadoop.fs.PathFilter%29>*(globStat.getPath
>>>  
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus.getPath%28%29>(),
>>>
>>> 220 
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#220>
>>>
>>>
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>>>
>>>                 inputFilter)) {
>>>
>>> 221 
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#221>
>>>
>>>
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>>>
>>>               result.add 
>>> <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/util/List.java#List.add%28org.apache.hadoop.fs.FileStatus%29>(stat);
>>>
>>> 222 
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#222>
>>>
>>>
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>>>
>>>             }
>>>
>>> 223 
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#223>
>>>
>>>
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>>>
>>>           } else {
>>>
>>> 224 
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#224>
>>>
>>>
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>>>
>>>             result.add 
>>> <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/util/List.java#List.add%28org.apache.hadoop.fs.FileStatus%29>(globStat);
>>>
>>> 225 
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#225>
>>>
>>>
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>>>
>>>           }
>>>
>>> 226 
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#226>
>>>
>>>
>>> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>>>
>>>         }
>>>
>>>
>>>
>>> However when invoking sc.textFile there are errors on directory entries: 
>>> "not a file". This behavior is confusing - given the proper support appears 
>>> to be in place for handling directories.
>>>
>>>
>>> 2015-03-03 15:04 GMT-08:00 Sean Owen <[email protected]>:
>>>
>>>> This API reads a directory of files, not one file. A "file" here
>>>> really means a directory full of part-* files. You do not need to read
>>>> those separately.
>>>>
>>>> Any syntax that works with Hadoop's FileInputFormat should work. I
>>>> thought you could specify a comma-separated list of paths? maybe I am
>>>> imagining that.
>>>>
>>>> On Tue, Mar 3, 2015 at 10:57 PM, S. Zhou <[email protected]>
>>>> wrote:
>>>> > Thanks Ted. Actually a follow up question. I need to read multiple
>>>> HDFS
>>>> > files into RDD. What I am doing now is: for each file I read them
>>>> into a
>>>> > RDD. Then later on I union all these RDDs into one RDD. I am not sure
>>>> if it
>>>> > is the best way to do it.
>>>> >
>>>> > Thanks
>>>> > Senqiang
>>>> >
>>>> >
>>>> > On Tuesday, March 3, 2015 2:40 PM, Ted Yu <[email protected]>
>>>> wrote:
>>>> >
>>>> >
>>>> > Looking at scaladoc:
>>>> >
>>>> >  /** Get an RDD for a Hadoop file with an arbitrary new API
>>>> InputFormat. */
>>>> >   def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]]
>>>> >
>>>> > Your conclusion is confirmed.
>>>>
>>>> >
>>>> > On Tue, Mar 3, 2015 at 1:59 PM, S. Zhou <[email protected]>
>>>> wrote:
>>>> >
>>>> > I did some experiments and it seems not. But I like to get
>>>> confirmation (or
>>>> > perhaps I missed something). If it does support, could u let me know
>>>> how to
>>>> > specify multiple folders? Thanks.
>>>> >
>>>> > Senqiang
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>
>>
>

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

Reply via email to