Hello Spark gurus, We have a spark streaming application that is consuming from a Flume stream and has some window operations. The input batch sizes are 1 minute and intermediate Window operations have window sizes of 1 minute, 1 hour and 6 hours. I enabled checkpointing and Write ahead log so that we can recover from any failures. I added explicit checkpoint Duration directives for each of the intermediate Window streams also and tried with 2 minute duration. However, I am running into a couple of issues at the moment:
1. The recovery time is long. Its close to 15 minutes after running the application for only a couple hours and then restarting it to test recovery. I tried running for shorter time i.e. about half hour and restarting and the recovery process still took 15 minutes ( I had started with a fresh checkpoint folder). 2. After recovery is completed, the input stream does seem to recover just fine and continue where it had left off at the crash, however, as the computations continue, I am getting incorrect results for every batch after recovery e.g. some double values are NaNs etc. This does not happen if I let the application just continue. Seems to me that some of the intermediate streams were not recovered properly and caused some of our computations to produce incorrect values. Any help/insights into how to go about tackling this will be appreciated. Thanks NB. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-recovery-takes-long-tp22876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org