I presume that you need to have access to the path of each file you are reading.
I don't know whether there is a good way to do that for HDFS, I need to read the files myself, something like: def openWithPath(inputPath: String, sc:SparkContext) = { val fs = (new Path(inputPath)).getFileSystem(sc.hadoopConfiguration) val filesIt = fs.listFiles(path, false) val paths = new ListBuffer[URI] while (filesIt.hasNext) { paths += filesIt.next.getPath.toUri } val withPaths = paths.toList.map{ p => sc.newAPIHadoopFile[LongWritable, Text, TextInputFormat](p.toString).map{ case (_,s) => (p, s.toString) } } withPaths.reduce{ _ ++ _ } } ... I would be interested if there is a better way to do the same thing ... Cheers, a: On Sun, Jun 1, 2014 at 6:00 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > Could you provide an example of what you mean? > > I know it's possible to create an RDD from a path with wildcards, like in > the subject. > > For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also > provide a comma delimited list of paths. > > Nick > > 2014년 6월 1일 일요일, Oleg Proudnikov<oleg.proudni...@gmail.com>님이 작성한 메시지: > > Hi All, >> >> Is it possible to create an RDD from a directory tree of the following >> form? >> >> RDD[(PATH, Seq[TEXT])] >> >> Thank you, >> Oleg >> >>