binaryFiles() on directory full of directories

Christopher Piggott Mon, 08 Jan 2018 07:03:42 -0800

I have a top level directory in HDFS that contains nothing but
subdirectories (no actual files).  In each one of those subdirs are a
combination of files and other subdirs


        /topdir/dir1/(lots of files)
        /topdir/dir2/(lots of files)
        /topdir/dir2//subdir/(lots of files)

I noticed something strange:

spark.sparkContext.binaryFiles("hdfs://10.240.2.200/topdir/*", 32*8)
    .filter { case (fileName, contents) => fileName.endsWith(".xyz") }
    .map { case (fileName, contents) => 1}
    .reduce(_+_)

fails with an ArrayOutOfBoundsException ... but if I specify it as:

  spark.sparkContext.binaryFiles("hdfs://10.240.2.200/topdir/*/*", 32*8)

it works.

I played around a little and found I could get the first attempt to work if
I just put one regular file in /topdir

This is with Spark 2.2.1

Is this known behavior?

--C

binaryFiles() on directory full of directories

Reply via email to