I have a top level directory in HDFS that contains nothing but subdirectories (no actual files). In each one of those subdirs are a combination of files and other subdirs
/topdir/dir1/(lots of files) /topdir/dir2/(lots of files) /topdir/dir2//subdir/(lots of files) I noticed something strange: spark.sparkContext.binaryFiles("hdfs://10.240.2.200/topdir/*", 32*8) .filter { case (fileName, contents) => fileName.endsWith(".xyz") } .map { case (fileName, contents) => 1} .reduce(_+_) fails with an ArrayOutOfBoundsException ... but if I specify it as: spark.sparkContext.binaryFiles("hdfs://10.240.2.200/topdir/*/*", 32*8) it works. I played around a little and found I could get the first attempt to work if I just put one regular file in /topdir This is with Spark 2.2.1 Is this known behavior? --C