Re: question about HadoopFsRelation

Ted Yu Sat, 24 Oct 2015 09:53:32 -0700

The code below was introduced by SPARK-7673 / PR #6225

See item #1 in the description of the PR.


Cheers

On Sat, Oct 24, 2015 at 12:59 AM, Koert Kuipers <ko...@tresata.com> wrote:

> the code that seems to flatMap directories to all the files inside is in
> the private HadoopFsRelation.buildScan:
>
>     // First assumes `input` is a directory path, and tries to get all
> files contained in it.
>       fileStatusCache.leafDirToChildrenFiles.getOrElse(
>         path,
>         // Otherwise, `input` might be a file path
>         fileStatusCache.leafFiles.get(path).toArray
>
> does anyone know why we want to get all the files when all hadoop
> inputformats can handle directories (and automatically get the files
> inside), and the recommended way of doing this in map-red is to pass in
> directories (to avoid the overhead and very large serialized jobconfs)?
>
>
> On Sat, Oct 24, 2015 at 12:23 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> i noticed in the comments for HadoopFsRelation.buildScan it says:
>>   * @param inputFiles For a non-partitioned relation, it contains paths
>> of all data files in the
>>    *        relation. For a partitioned relation, it contains paths of
>> all data files in a single
>>    *        selected partition.
>>
>> do i understand it correctly that it actually lists all the data files
>> (part files), not just data directories that contain the files?
>> if so,that sounds like trouble to me, because most implementations will
>> use this info to set the input paths for FileInputFormat. for example in
>> ParquetRelation:
>> if (inputFiles.nonEmpty) {
>>       FileInputFormat.setInputPaths(job, inputFiles.map(_.getPath): _*)
>>     }
>>
>> if all the part files are listed this will make the jobConf huge and
>> could cause issues upon serialization and/or broadcasting.
>>
>> it can also lead to other inefficiencies, for example spark-avro creates
>> a RDD for every input (part) file, which quickly leads to thousands of RDDs.
>>
>> i think instead of files only the directories should be listed in the
>> input path?
>>
>
>

Re: question about HadoopFsRelation

Reply via email to