Hi, On Sat, Feb 21, 2015 at 1:05 AM, craigv <craigvanderbo...@gmail.com> wrote: > > /Might it be possible to perform "large batches" processing on HDFS time > > series data using Spark Streaming?/ > > > > 1.I understand that there is not currently an InputDStream that could do > > what's needed. I would have to create such a thing. > > 2. Time is a problem. I would have to use the timestamps on our events > for > > any time-based logic and state management > > 3. The "batch duration" would become meaningless in this scenario. > Could I > > just set it to something really small (say 1 second) and then let it > "fall > > behind", processing the data as quickly as it could? >
So, if it is not an issue for you if everything is processed in one batch, you can use streamingContext.textFileStream(hdfsDirectory). This will create a DStream that has a huge RDD with all data in the first batch and then empty batches afterwards. You can have small batch size, should not be a problem. An alternative would be to write some code that creates one RDD per file in your HDFS directory, create a Queue of those RDDs and then use streamingContext.queueStream(), possibly with the oneAtATime=true parameter (which will process only one RDD per batch). However, to do window computations etc with the timestamps embedded *in* your data will be a major effort, as in: You cannot use the existing windowing functionality from Spark Streaming. If you want to read more about that, there have been a number of discussions about that topic on this list; maybe you can look them up. Tobias