In my Spark programming thus far my unit of work has been a single row from an hdfs file by creating an RDD[Array[String]] with something like:

spark.textFile(path).map(_.split("\t"))

Now, I'd like to do some work over a large collection of files in which the unit of work is a single file (rather than a row from a file.) Does Spark anticipate users creating an RDD[URI] or RDD[File] or some such and supporting actions and transformations that one might want to do on such an RDD? Any advice and/or code snippets would be appreciated!

Thanks,
Philip

Reply via email to