See if https://issues.apache.org/jira/browse/SPARK-3660 helps you. My patch
has been accepted and, this enhancement is scheduled for 1.3.0.

This lets you specify initialRDD for updateStateByKey operation. Let me
know if you need any information.

On Sun, Feb 22, 2015 at 5:21 PM, Tobias Pfeiffer <t...@preferred.jp> wrote:

> Hi,
>
> On Sat, Feb 21, 2015 at 1:05 AM, craigv <craigvanderbo...@gmail.com>
>> wrote:
>> > /Might it be possible to perform "large batches" processing on HDFS time
>> > series data using Spark Streaming?/
>> >
>> > 1.I understand that there is not currently an InputDStream that could do
>> > what's needed.  I would have to create such a thing.
>> > 2. Time is a problem.  I would have to use the timestamps on our events
>> for
>> > any time-based logic and state management
>> > 3. The "batch duration" would become meaningless in this scenario.
>> Could I
>> > just set it to something really small (say 1 second) and then let it
>> "fall
>> > behind", processing the data as quickly as it could?
>>
>
> So, if it is not an issue for you if everything is processed in one batch,
> you can use streamingContext.textFileStream(hdfsDirectory). This will
> create a DStream that has a huge RDD with all data in the first batch and
> then empty batches afterwards. You can have small batch size, should not be
> a problem.
> An alternative would be to write some code that creates one RDD per file
> in your HDFS directory, create a Queue of those RDDs and then use
> streamingContext.queueStream(), possibly with the oneAtATime=true parameter
> (which will process only one RDD per batch).
>
> However, to do window computations etc with the timestamps embedded *in*
> your data will be a major effort, as in: You cannot use the existing
> windowing functionality from Spark Streaming. If you want to read more
> about that, there have been a number of discussions about that topic on
> this list; maybe you can look them up.
>
> Tobias
>
>

Reply via email to