What is the precise use case and reasoning behind wanting to work on a File as the "record" in an RDD?
CombineFileInputFormat may be useful in some way: http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/ https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/MultiFileWordCount.java — Sent from Mailbox for iPhone On Thu, Jan 30, 2014 at 7:34 PM, Christopher Nguyen <[email protected]> wrote: > Philip, I guess the key problem statement is the "large collection of" > part? If so this may be helpful, at the HDFS level: > http://blog.cloudera.com/blog/2009/02/the-small-files-problem/. > Otherwise you can always start with an RDD[fileUri] and go from there to an > RDD[(fileUri, read_contents)]. > Sent while mobile. Pls excuse typos etc. > On Jan 30, 2014 9:13 AM, "尹绪森" <[email protected]> wrote: >> I am also interested in this. My solution now is making a file to a line >> of string, i.e. deleting all '\n', then adding filename as the head of line >> with a space. >> >> [filename] [space] [content] >> >> Anyone have better ideas ? >> 2014-1-31 AM12:18于 "Philip Ogren" <[email protected]>写道: >> >>> In my Spark programming thus far my unit of work has been a single row >>> from an hdfs file by creating an RDD[Array[String]] with something like: >>> >>> spark.textFile(path).map(_.split("\t")) >>> >>> Now, I'd like to do some work over a large collection of files in which >>> the unit of work is a single file (rather than a row from a file.) Does >>> Spark anticipate users creating an RDD[URI] or RDD[File] or some such and >>> supporting actions and transformations that one might want to do on such an >>> RDD? Any advice and/or code snippets would be appreciated! >>> >>> Thanks, >>> Philip >>> >>
