Hello Spark gurus,

We have a spark streaming application that is consuming from a Flume stream
and has some window operations. The input batch sizes are 1 minute and
intermediate Window operations have window sizes of 1 minute, 1 hour and 6
hours. I enabled checkpointing and Write ahead log so that we can recover
from any failures. I added explicit checkpoint Duration directives for each
of the intermediate Window streams also and tried with 2 minute duration.
However, I am running into a couple of issues at the moment:

1. The recovery time is long. Its close to 15 minutes after running the
application for only a couple hours and then restarting it to test recovery.
I tried running for shorter time i.e. about half hour and restarting and the
recovery process still took 15 minutes ( I had started with a fresh
checkpoint folder).

2. After recovery is completed, the input stream does seem to recover just
fine and continue where it had left off at the crash, however, as the
computations continue, I am getting incorrect results for every batch after
recovery e.g. some double values are NaNs etc. This does not happen if I let
the application just continue. Seems to me that some of the intermediate
streams were not recovered properly and caused some of our computations to
produce incorrect values.

Any help/insights into how to go about tackling this will be appreciated.

Thanks
NB.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-recovery-takes-long-tp22876.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to