In my Spark programming thus far my unit of work has been a single row
from an hdfs file by creating an RDD[Array[String]] with something like:
spark.textFile(path).map(_.split("\t"))
Now, I'd like to do some work over a large collection of files in which
the unit of work is a single file (rather than a row from a file.) Does
Spark anticipate users creating an RDD[URI] or RDD[File] or some such
and supporting actions and transformations that one might want to do on
such an RDD? Any advice and/or code snippets would be appreciated!
Thanks,
Philip