The code below was introduced by SPARK-7673 / PR #6225 See item #1 in the description of the PR.
Cheers On Sat, Oct 24, 2015 at 12:59 AM, Koert Kuipers <ko...@tresata.com> wrote: > the code that seems to flatMap directories to all the files inside is in > the private HadoopFsRelation.buildScan: > > // First assumes `input` is a directory path, and tries to get all > files contained in it. > fileStatusCache.leafDirToChildrenFiles.getOrElse( > path, > // Otherwise, `input` might be a file path > fileStatusCache.leafFiles.get(path).toArray > > does anyone know why we want to get all the files when all hadoop > inputformats can handle directories (and automatically get the files > inside), and the recommended way of doing this in map-red is to pass in > directories (to avoid the overhead and very large serialized jobconfs)? > > > On Sat, Oct 24, 2015 at 12:23 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> i noticed in the comments for HadoopFsRelation.buildScan it says: >> * @param inputFiles For a non-partitioned relation, it contains paths >> of all data files in the >> * relation. For a partitioned relation, it contains paths of >> all data files in a single >> * selected partition. >> >> do i understand it correctly that it actually lists all the data files >> (part files), not just data directories that contain the files? >> if so,that sounds like trouble to me, because most implementations will >> use this info to set the input paths for FileInputFormat. for example in >> ParquetRelation: >> if (inputFiles.nonEmpty) { >> FileInputFormat.setInputPaths(job, inputFiles.map(_.getPath): _*) >> } >> >> if all the part files are listed this will make the jobConf huge and >> could cause issues upon serialization and/or broadcasting. >> >> it can also lead to other inefficiencies, for example spark-avro creates >> a RDD for every input (part) file, which quickly leads to thousands of RDDs. >> >> i think instead of files only the directories should be listed in the >> input path? >> > >