Thank you for the links! These look very useful.
I do not have a precise use case - at this point I'm just exploring what
is possible/feasible. Like the blog suggests, I might have a bunch of
images lying around and might want to collect meta-data from them. In
my case, I do a lot of NLP and so I would like to process text from a
large collection of documents, perhaps after running through Tika. Both
of these use cases seem closely related from a Spark user's perspective.
On 1/30/2014 11:02 AM, Nick Pentreath wrote:
What is the precise use case and reasoning behind wanting to work on a
File as the "record" in an RDD?
CombineFileInputFormat may be useful in some way:
http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/MultiFileWordCount.java
—
Sent from Mailbox <https://www.dropbox.com/mailbox> for iPhone
On Thu, Jan 30, 2014 at 7:34 PM, Christopher Nguyen <[email protected]
<mailto:[email protected]>> wrote:
Philip, I guess the key problem statement is the "large collection
of" part? If so this may be helpful, at the HDFS level:
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/.
Otherwise you can always start with an RDD[fileUri] and go from
there to an RDD[(fileUri, read_contents)].
Sent while mobile. Pls excuse typos etc.
On Jan 30, 2014 9:13 AM, "尹绪森" <[email protected]
<mailto:[email protected]>> wrote:
I am also interested in this. My solution now is making a file
to a line of string, i.e. deleting all '\n', then adding
filename as the head of line with a space.
[filename] [space] [content]
Anyone have better ideas ?
2014-1-31 AM12:18于 "Philip Ogren" <[email protected]
<mailto:[email protected]>> 写道:
In my Spark programming thus far my unit of work has been
a single row from an hdfs file by creating an
RDD[Array[String]] with something like:
spark.textFile(path).map(_.split("\t"))
Now, I'd like to do some work over a large collection of
files in which the unit of work is a single file (rather
than a row from a file.) Does Spark anticipate users
creating an RDD[URI] or RDD[File] or some such and
supporting actions and transformations that one might want
to do on such an RDD? Any advice and/or code snippets
would be appreciated!
Thanks,
Philip