To add to the discussion, Spark Streaming's text file stream, automatically
detects new files and generates RDD out of them. For example, if you run 10
seconds batches, then all new files (of the same format) generated in the
directory every interval will be read and made into per-interval RDDs. Then
you can do whatever you want with those RDDs.

var unionRDD = ...

streamingContext.textFileStream(<directory>).foreachRDD(rdd => {
     // do what you want with the RDD
     // if you want to keep unioning
     unionRDD = unionRDD.union(rdd)
})

However, not that keeping on unioning RDD can rapidly increase the number
of partitions in the unioned RDD, which may degrade performance. Consider
using RDD.coalesce periodically to reduce the number of partitions.

TD


On Wed, Feb 19, 2014 at 5:44 AM, Ashish Rangole <arang...@gmail.com> wrote:

> You could also look at how the Spark Streaming DStream does what you
> described.
>
> Take a look at Spark StreamingContext.textFileStream implementation.
> On Feb 18, 2014 8:02 PM, "David Thomas" <dt5434...@gmail.com> wrote:
>
>> Perfect.
>>
>>
>> On Tue, Feb 18, 2014 at 7:58 PM, Mayur Rustagi 
>> <mayur.rust...@gmail.com>wrote:
>>
>>> RDD is immutable so modification of RDD is not possible, you can
>>> generate a new RDD unioning the two RDD created from new files and old
>>> in-memory RDD.
>>> Regards
>>> Mayur
>>>
>>> Mayur Rustagi
>>> Ph: +919632149971
>>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>>> https://twitter.com/mayur_rustagi
>>>
>>>
>>>
>>> On Tue, Feb 18, 2014 at 6:33 PM, David Thomas <dt5434...@gmail.com>wrote:
>>>
>>>> Let's say I have an RDD of text files from HDFS. During the runtime, is
>>>> it possible to check for new files in a particular directory and if
>>>> present, add them to the existing RDD?
>>>>
>>>
>>>
>>

Reply via email to