Hi Folks,
I am exploring spark for streaming from two sources (a) Kinesis and (b) HDFS for some of our use-cases. Since we maintain state gathered over last x hours in spark streaming, we would like to replay the data from last x hours as batches during deployment. I have gone through the Spark APIs but could not find anything that initiates with older timestamp. Appreciate your input on the same. 1. Restarting with check-pointing runs the batches faster for missed timestamp period, but when we upgrade with new code, the same checkpoint directory cannot be reused. 2. For the case with kinesis as source, we can change the last checked sequence number in DynamoDB to get the data from last x hours, but this will be one large bunch of data for first restarted batch. So the data is not processed as natural multiple batches inside spark. 3. For the source from HDFS, I could not find any alternative to start the streaming from old timestamp data, unless I manually (or with script) rename the old files after starting the stream. (This workaround leads to other complications too further). 4. May I know how is the zero data loss is achieved while having hdfs as source? i.e. if the driver fails while processing a micro batch, what happens when the application is restarted? Is the same micro-batch reprocessed? Regards Ashok