I’ve set “mapreduce.input.fileinputformat.input.dir.recursive” to “true” in the SparkConf I use to instantiate SparkContext, and I confirm this at runtime in my scala job to print out this property, but sparkContext.textFile(“/foo/*/bar/*.gz”) still fails (so do /foo/**/bar/*.gz and /foo/*/*/bar/*.gz).
Any thoughts or workarounds? I’m considering using bash globbing to match files recursively and feed hundreds of thousands of arguments to spark-submit. Reasons for/against? From: Ted Yu <yuzhih...@gmail.com> Date: Wednesday, December 9, 2015 at 3:50 PM To: James Ding <jd...@palantir.com> Cc: "user@spark.apache.org" <user@spark.apache.org> Subject: Re: Recursive nested wildcard directory walking in Spark Have you seen this thread ? http://search-hadoop.com/m/q3RTt2uhMX1UhnCc1&subj=Re+Does+sc+newAPIHadoopFil e+support+multiple+directories+or+nested+directories+ <https://urldefense.proofpoint.com/v2/url?u=http-3A__search-2Dhadoop.com_m_q 3RTt2uhMX1UhnCc1-26subj-3DRe-2BDoes-2Bsc-2BnewAPIHadoopFile-2Bsupport-2Bmult iple-2Bdirectories-2Bor-2Bnested-2Bdirectories-2B&d=CwMFaQ&c=izlc9mHr637UR4l pLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=nX8GRkcN51t--NvYyCeLIgTrhCN2jV0M6wL5LyNggFg&m =GGSXdv6Ymo7CCgd1WS1BuqPmIU9HOhQq2WE0fSnun88&s=2v9s1Rq7cK3MLQpdOGfnlAnzPPh9z GR-9nsVgwOqMyw&e=> FYI On Wed, Dec 9, 2015 at 11:18 AM, James Ding <jd...@palantir.com> wrote: > Hi! > > My name is James, and I’m working on a question there doesn’t seem to be a lot > of answers about online. I was hoping spark/hadoop gurus could shed some light > on this. > > I have a data feed on NFS that looks like /foo////bar/.gz > Currently I have a spark scala job that calls > sparkContext.textFile("/foo/*/*/*/bar/*.gz") > Upstream owners for the data feed have told me they may add additional nested > directories or remove them from files relevant to me. In other words, files > relevant to my spark job might sit on paths that look like: > * /foo/a/b/c/d/bar/*.gz > * /foo/a/b/bar/*.gz > They will do this with only some files and without warning. Anyone have ideas > on how I can configure spark to create an RDD from any textfiles that fit the > pattern /foo/**/bar/*.gz where ** represents a variable number of wildcard > directories? > I'm working with on order of 10^5 and 10^6 files which has discouraged me from > using something besides Hadoop fs API to walk the filesystem and feed that > input to my spark job, but I'm open to suggestions here also. > Thanks! > James Ding
smime.p7s
Description: S/MIME cryptographic signature