Answers inline.
On Wed, Jul 16, 2014 at 5:39 PM, Bill Jay <bill.jaypeter...@gmail.com> wrote: > Hi all, > > I am currently using Spark Streaming to conduct a real-time data > analytics. We receive data from Kafka. We want to generate output files > that contain results that are based on the data we receive from a specific > time interval. > > I have several questions on Spark Streaming's timestamp: > > 1) If I use saveAsTextFiles, it seems Spark streaming will generate files > in complete minutes, such as 5:00:01, 5:00:01 (converted from Unix time), > etc. Does this mean the results are based on the data from 5:00:01 to > 5:00:02, 5:00:02 to 5:00:03, etc. Or the time stamps just mean the time the > files are generated? > > File named 5:00:01 contains results from data received between 5:00:00 and 5:00:01 (based on system time of the cluster). > 2) If I do not use saveAsTextFiles, how do I get the exact time interval > of the RDD when I use foreachRDD to do custom output of the results? > > There is a version of foreachRDD which allows you specify the function that takes in Time object. > 3) How can we specify the starting time of the batches? > What do you mean? Batches are timed based on the system time of the cluster. > > Thanks! > > Bill >