Hi Folks,


I am exploring spark for streaming from two sources (a) Kinesis and (b)
HDFS for some of our use-cases. Since we maintain state gathered over last
x hours in spark streaming, we would like to replay the data from last x
hours as batches during deployment. I have gone through the Spark APIs but
could not find anything that initiates with older timestamp. Appreciate
your input on the same.

1.      Restarting with check-pointing runs the batches faster for missed
timestamp period, but when we upgrade with new code, the same checkpoint
directory cannot be reused.

2.      For the case with kinesis as source, we can change the last checked
sequence number in DynamoDB to get the data from last x hours, but this
will be one large bunch of data for first restarted batch. So the data is
not processed as natural multiple batches inside spark.

3.      For the source from HDFS, I could not find any alternative to start
the streaming from old timestamp data, unless I manually (or with script)
rename the old files after starting the stream. (This workaround leads to
other complications too further).

4.      May I know how is the zero data loss is achieved while having hdfs
as source? i.e. if the driver fails while processing a micro batch, what
happens when the application is restarted? Is the same micro-batch
reprocessed?



Regards

Ashok

Reply via email to