the code that seems to flatMap directories to all the files inside is in the private HadoopFsRelation.buildScan:
// First assumes `input` is a directory path, and tries to get all files contained in it. fileStatusCache.leafDirToChildrenFiles.getOrElse( path, // Otherwise, `input` might be a file path fileStatusCache.leafFiles.get(path).toArray does anyone know why we want to get all the files when all hadoop inputformats can handle directories (and automatically get the files inside), and the recommended way of doing this in map-red is to pass in directories (to avoid the overhead and very large serialized jobconfs)? On Sat, Oct 24, 2015 at 12:23 AM, Koert Kuipers <ko...@tresata.com> wrote: > i noticed in the comments for HadoopFsRelation.buildScan it says: > * @param inputFiles For a non-partitioned relation, it contains paths of > all data files in the > * relation. For a partitioned relation, it contains paths of all > data files in a single > * selected partition. > > do i understand it correctly that it actually lists all the data files > (part files), not just data directories that contain the files? > if so,that sounds like trouble to me, because most implementations will > use this info to set the input paths for FileInputFormat. for example in > ParquetRelation: > if (inputFiles.nonEmpty) { > FileInputFormat.setInputPaths(job, inputFiles.map(_.getPath): _*) > } > > if all the part files are listed this will make the jobConf huge and could > cause issues upon serialization and/or broadcasting. > > it can also lead to other inefficiencies, for example spark-avro creates a > RDD for every input (part) file, which quickly leads to thousands of RDDs. > > i think instead of files only the directories should be listed in the > input path? >