Re: How to start spark streaming application with recent past timestamp for replay of old batches?

2016-02-21 Thread Akhil Das
On Mon, Feb 22, 2016 at 12:18 PM, ashokkumar rajendran <
ashokkumar.rajend...@gmail.com> wrote:

> Hi Folks,
>
>
>
> I am exploring spark for streaming from two sources (a) Kinesis and (b)
> HDFS for some of our use-cases. Since we maintain state gathered over last
> x hours in spark streaming, we would like to replay the data from last x
> hours as batches during deployment. I have gone through the Spark APIs but
> could not find anything that initiates with older timestamp. Appreciate
> your input on the same.
>
> 1.  Restarting with check-pointing runs the batches faster for missed
> timestamp period, but when we upgrade with new code, the same checkpoint
> directory cannot be reused.
>
​=> It is true that you won't be able to use the checkpoint when you
upgrade your code, the production codes are not upgraded every now and
then. You can basically create a configuration file in which you can put
most of then stuffs (like streaming duration, parameters etc) instead of
updating them in the code and breaking the checkpoint. ​


> 2.  For the case with kinesis as source, we can change the last
> checked sequence number in DynamoDB to get the data from last x hours, but
> this will be one large bunch of data for first restarted batch. So the data
> is not processed as natural multiple batches inside spark.
>
​=> The kinesis API has a way to limit the data rate, you might want to
look into that and implement a custom receiver for your use-case.​

​http://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetRecords.html​

> 3.  For the source from HDFS, I could not find any alternative to
> start the streaming from old timestamp data, unless I manually (or with
> script) rename the old files after starting the stream. (This workaround
> leads to other complications too further).
>
​=> What you can do is, once the files are processed, you can move them to
a different directory and when you restart the stream for whatever reason,
you can make it pick all the files instead of the latest ones (by passing
the *newFilesOnly* boolean param)​


> 4.  May I know how is the zero data loss is achieved while having
> hdfs as source? i.e. if the driver fails while processing a micro batch,
> what happens when the application is restarted? Is the same micro-batch
> reprocessed?
>
​=> Yes, If the application is restarted then the micro-batch will be
reprocessed.​


>
>
> Regards
>
> Ashok
>


How to start spark streaming application with recent past timestamp for replay of old batches?

2016-02-21 Thread ashokkumar rajendran
Hi Folks,



I am exploring spark for streaming from two sources (a) Kinesis and (b)
HDFS for some of our use-cases. Since we maintain state gathered over last
x hours in spark streaming, we would like to replay the data from last x
hours as batches during deployment. I have gone through the Spark APIs but
could not find anything that initiates with older timestamp. Appreciate
your input on the same.

1.  Restarting with check-pointing runs the batches faster for missed
timestamp period, but when we upgrade with new code, the same checkpoint
directory cannot be reused.

2.  For the case with kinesis as source, we can change the last checked
sequence number in DynamoDB to get the data from last x hours, but this
will be one large bunch of data for first restarted batch. So the data is
not processed as natural multiple batches inside spark.

3.  For the source from HDFS, I could not find any alternative to start
the streaming from old timestamp data, unless I manually (or with script)
rename the old files after starting the stream. (This workaround leads to
other complications too further).

4.  May I know how is the zero data loss is achieved while having hdfs
as source? i.e. if the driver fails while processing a micro batch, what
happens when the application is restarted? Is the same micro-batch
reprocessed?



Regards

Ashok