Re: RDD[URI]

Nick Pentreath Thu, 30 Jan 2014 10:06:54 -0800

What is the precise use case and reasoning behind wanting to work on a File as 
the "record" in an RDD?



CombineFileInputFormat may be useful in some way: 
http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/





https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/MultiFileWordCount.java






—
Sent from Mailbox for iPhone

On Thu, Jan 30, 2014 at 7:34 PM, Christopher Nguyen <[email protected]>
wrote:

> Philip, I guess the key problem statement is the "large collection of"
> part? If so this may be helpful, at the HDFS level:
> http://blog.cloudera.com/blog/2009/02/the-small-files-problem/.
> Otherwise you can always start with an RDD[fileUri] and go from there to an
> RDD[(fileUri, read_contents)].
> Sent while mobile. Pls excuse typos etc.
> On Jan 30, 2014 9:13 AM, "尹绪森" <[email protected]> wrote:
>> I am also interested in this. My solution now is making a file to a line
>> of string, i.e. deleting all '\n', then adding filename as the head of line
>> with a space.
>>
>> [filename] [space] [content]
>>
>> Anyone have better ideas ?
>> 2014-1-31 AM12:18于 "Philip Ogren" <[email protected]>写道：
>>
>>> In my Spark programming thus far my unit of work has been a single row
>>> from an hdfs file by creating an RDD[Array[String]] with something like:
>>>
>>> spark.textFile(path).map(_.split("\t"))
>>>
>>> Now, I'd like to do some work over a large collection of files in which
>>> the unit of work is a single file (rather than a row from a file.)  Does
>>> Spark anticipate users creating an RDD[URI] or RDD[File] or some such and
>>> supporting actions and transformations that one might want to do on such an
>>> RDD?  Any advice and/or code snippets would be appreciated!
>>>
>>> Thanks,
>>> Philip
>>>
>>

Re: RDD[URI]

Reply via email to