Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

Stephen Boesch Tue, 03 Mar 2015 15:15:06 -0800

The sc.textFile() invokes the Hadoop FileInputFormat via the (subclass)
TextInputFormat.  Inside the logic does exist to do the recursive directory
reading - i.e. first detecting if an entry were a directory and if so then
descending:


     for (FileStatus
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus>
globStat: matches) {

218 
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#218>

<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>

         * if (globStat.isDir
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus.isDir%28%29>())
{*

*219
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#219>*
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>

            for(FileStatus
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus>
stat: f*s**.listStatus
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileSystem.java#FileSystem.listStatus%28org.apache.hadoop.fs.Path%2Corg.apache.hadoop.fs.PathFilter%29>*(globStat.getPath
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus.getPath%28%29>(),

220 
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#220>

<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>

                inputFilter)) {

221 
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#221>

<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>

              result.add
<http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/util/List.java#List.add%28org.apache.hadoop.fs.FileStatus%29>(stat);

222 
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#222>

<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>

            }

223 
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#223>

<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>

          } else {

224 
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#224>

<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>

            result.add
<http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/util/List.java#List.add%28org.apache.hadoop.fs.FileStatus%29>(globStat);

225 
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#225>

<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>

          }

226 
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#226>

<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>

        }



However when invoking sc.textFile there are errors on directory
entries: "not a file". This behavior is confusing - given the proper
support appears to be in place for handling directories.


2015-03-03 15:04 GMT-08:00 Sean Owen <[email protected]>:

> This API reads a directory of files, not one file. A "file" here
> really means a directory full of part-* files. You do not need to read
> those separately.
>
> Any syntax that works with Hadoop's FileInputFormat should work. I
> thought you could specify a comma-separated list of paths? maybe I am
> imagining that.
>
> On Tue, Mar 3, 2015 at 10:57 PM, S. Zhou <[email protected]> wrote:
> > Thanks Ted. Actually a follow up question. I need to read multiple HDFS
> > files into RDD. What I am doing now is: for each file I read them into a
> > RDD. Then later on I union all these RDDs into one RDD. I am not sure if
> it
> > is the best way to do it.
> >
> > Thanks
> > Senqiang
> >
> >
> > On Tuesday, March 3, 2015 2:40 PM, Ted Yu <[email protected]> wrote:
> >
> >
> > Looking at scaladoc:
> >
> >  /** Get an RDD for a Hadoop file with an arbitrary new API InputFormat.
> */
> >   def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]]
> >
> > Your conclusion is confirmed.
> >
> > On Tue, Mar 3, 2015 at 1:59 PM, S. Zhou <[email protected]>
> wrote:
> >
> > I did some experiments and it seems not. But I like to get confirmation
> (or
> > perhaps I missed something). If it does support, could u let me know how
> to
> > specify multiple folders? Thanks.
> >
> > Senqiang
> >
> >
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

Reply via email to