Philip, I guess the key problem statement is the "large collection of" part? If so this may be helpful, at the HDFS level: http://blog.cloudera.com/blog/2009/02/the-small-files-problem/.
Otherwise you can always start with an RDD[fileUri] and go from there to an RDD[(fileUri, read_contents)]. Sent while mobile. Pls excuse typos etc. On Jan 30, 2014 9:13 AM, "尹绪森" <[email protected]> wrote: > I am also interested in this. My solution now is making a file to a line > of string, i.e. deleting all '\n', then adding filename as the head of line > with a space. > > [filename] [space] [content] > > Anyone have better ideas ? > 2014-1-31 AM12:18于 "Philip Ogren" <[email protected]>写道: > >> In my Spark programming thus far my unit of work has been a single row >> from an hdfs file by creating an RDD[Array[String]] with something like: >> >> spark.textFile(path).map(_.split("\t")) >> >> Now, I'd like to do some work over a large collection of files in which >> the unit of work is a single file (rather than a row from a file.) Does >> Spark anticipate users creating an RDD[URI] or RDD[File] or some such and >> supporting actions and transformations that one might want to do on such an >> RDD? Any advice and/or code snippets would be appreciated! >> >> Thanks, >> Philip >> >
