Hi,

On Sat, Feb 21, 2015 at 1:05 AM, craigv <craigvanderbo...@gmail.com> wrote:
> > /Might it be possible to perform "large batches" processing on HDFS time
> > series data using Spark Streaming?/
> >
> > 1.I understand that there is not currently an InputDStream that could do
> > what's needed.  I would have to create such a thing.
> > 2. Time is a problem.  I would have to use the timestamps on our events
> for
> > any time-based logic and state management
> > 3. The "batch duration" would become meaningless in this scenario.
> Could I
> > just set it to something really small (say 1 second) and then let it
> "fall
> > behind", processing the data as quickly as it could?
>

So, if it is not an issue for you if everything is processed in one batch,
you can use streamingContext.textFileStream(hdfsDirectory). This will
create a DStream that has a huge RDD with all data in the first batch and
then empty batches afterwards. You can have small batch size, should not be
a problem.
An alternative would be to write some code that creates one RDD per file in
your HDFS directory, create a Queue of those RDDs and then use
streamingContext.queueStream(), possibly with the oneAtATime=true parameter
(which will process only one RDD per batch).

However, to do window computations etc with the timestamps embedded *in*
your data will be a major effort, as in: You cannot use the existing
windowing functionality from Spark Streaming. If you want to read more
about that, there have been a number of discussions about that topic on
this list; maybe you can look them up.

Tobias

Reply via email to