All, I have a streaming application that monitors a HDFS folder and compute some metrics based on this data, the data in this folder will be updated by another uploaded application.
The streaming application's batch interval is 1 minute, batch processing time of streaming is about 30 seconds, its skeleton is like this: streamingContext.checkpoint(...) val fileDStream = streamingContext.textFileStream(hdfs_folder_path) fileDStream.map(...).reduceByKey().updateStateByKey(...).foreachRDD(rdd => { /// send the rdd to outside }) The streaming application and the uploaded application were stopped at last Friday and restarted at Monday. We found that the streaming application takes about one half day to catch up (totally about 3600 batches), since we do not have any new files at this period, nor do have any window operations, so these loops do nothing valued. *Is there any way to skip these batches or to speed up the catch up processing?* Thanks! Terry