Re: question about HadoopFsRelation

Koert Kuipers Sat, 24 Oct 2015 01:00:03 -0700

the code that seems to flatMap directories to all the files inside is in
the private HadoopFsRelation.buildScan:


    // First assumes `input` is a directory path, and tries to get all
files contained in it.
      fileStatusCache.leafDirToChildrenFiles.getOrElse(
        path,
        // Otherwise, `input` might be a file path
        fileStatusCache.leafFiles.get(path).toArray

does anyone know why we want to get all the files when all hadoop
inputformats can handle directories (and automatically get the files
inside), and the recommended way of doing this in map-red is to pass in
directories (to avoid the overhead and very large serialized jobconfs)?


On Sat, Oct 24, 2015 at 12:23 AM, Koert Kuipers <ko...@tresata.com> wrote:

> i noticed in the comments for HadoopFsRelation.buildScan it says:
>   * @param inputFiles For a non-partitioned relation, it contains paths of
> all data files in the
>    *        relation. For a partitioned relation, it contains paths of all
> data files in a single
>    *        selected partition.
>
> do i understand it correctly that it actually lists all the data files
> (part files), not just data directories that contain the files?
> if so,that sounds like trouble to me, because most implementations will
> use this info to set the input paths for FileInputFormat. for example in
> ParquetRelation:
> if (inputFiles.nonEmpty) {
>       FileInputFormat.setInputPaths(job, inputFiles.map(_.getPath): _*)
>     }
>
> if all the part files are listed this will make the jobConf huge and could
> cause issues upon serialization and/or broadcasting.
>
> it can also lead to other inefficiencies, for example spark-avro creates a
> RDD for every input (part) file, which quickly leads to thousands of RDDs.
>
> i think instead of files only the directories should be listed in the
> input path?
>

Re: question about HadoopFsRelation

Reply via email to