Thank you for the links!  These look very useful.

I do not have a precise use case - at this point I'm just exploring what is possible/feasible. Like the blog suggests, I might have a bunch of images lying around and might want to collect meta-data from them. In my case, I do a lot of NLP and so I would like to process text from a large collection of documents, perhaps after running through Tika. Both of these use cases seem closely related from a Spark user's perspective.



On 1/30/2014 11:02 AM, Nick Pentreath wrote:
What is the precise use case and reasoning behind wanting to work on a File as the "record" in an RDD?

CombineFileInputFormat may be useful in some way: http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/

https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/MultiFileWordCount.java


—
Sent from Mailbox <https://www.dropbox.com/mailbox> for iPhone


On Thu, Jan 30, 2014 at 7:34 PM, Christopher Nguyen <[email protected] <mailto:[email protected]>> wrote:

    Philip, I guess the key problem statement is the "large collection
    of" part? If so this may be helpful, at the HDFS level:
    http://blog.cloudera.com/blog/2009/02/the-small-files-problem/.

    Otherwise you can always start with an RDD[fileUri] and go from
    there to an RDD[(fileUri, read_contents)].

    Sent while mobile. Pls excuse typos etc.

    On Jan 30, 2014 9:13 AM, "尹绪森" <[email protected]
    <mailto:[email protected]>> wrote:

        I am also interested in this. My solution now is making a file
        to a line of string, i.e. deleting all '\n', then adding
        filename as the head of line with a space.

        [filename] [space] [content]

        Anyone have better ideas ?

        2014-1-31 AM12:18于 "Philip Ogren" <[email protected]
        <mailto:[email protected]>> 写道:

            In my Spark programming thus far my unit of work has been
            a single row from an hdfs file by creating an
            RDD[Array[String]] with something like:

            spark.textFile(path).map(_.split("\t"))

            Now, I'd like to do some work over a large collection of
            files in which the unit of work is a single file (rather
            than a row from a file.)  Does Spark anticipate users
            creating an RDD[URI] or RDD[File] or some such and
            supporting actions and transformations that one might want
            to do on such an RDD?  Any advice and/or code snippets
            would be appreciated!

            Thanks,
            Philip



Reply via email to