Philip, I guess the key problem statement is the "large collection of"
part? If so this may be helpful, at the HDFS level:
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/.

Otherwise you can always start with an RDD[fileUri] and go from there to an
RDD[(fileUri, read_contents)].

Sent while mobile. Pls excuse typos etc.
On Jan 30, 2014 9:13 AM, "尹绪森" <[email protected]> wrote:

> I am also interested in this. My solution now is making a file to a line
> of string, i.e. deleting all '\n', then adding filename as the head of line
> with a space.
>
> [filename] [space] [content]
>
> Anyone have better ideas ?
> 2014-1-31 AM12:18于 "Philip Ogren" <[email protected]>写道:
>
>> In my Spark programming thus far my unit of work has been a single row
>> from an hdfs file by creating an RDD[Array[String]] with something like:
>>
>> spark.textFile(path).map(_.split("\t"))
>>
>> Now, I'd like to do some work over a large collection of files in which
>> the unit of work is a single file (rather than a row from a file.)  Does
>> Spark anticipate users creating an RDD[URI] or RDD[File] or some such and
>> supporting actions and transformations that one might want to do on such an
>> RDD?  Any advice and/or code snippets would be appreciated!
>>
>> Thanks,
>> Philip
>>
>

Reply via email to