All,

I have a streaming application that monitors a HDFS folder and compute some
metrics based on this data, the data in this folder will be updated by
another uploaded application.

The streaming application's batch interval is 1 minute, batch processing
time of streaming is about 30 seconds, its skeleton is like this:

streamingContext.checkpoint(...)
val fileDStream = streamingContext.textFileStream(hdfs_folder_path)
fileDStream.map(...).reduceByKey().updateStateByKey(...).foreachRDD(rdd => {
 /// send the rdd to outside
})

The streaming application and the uploaded application were stopped at last
Friday and restarted at Monday. We found that the streaming application
takes about one half day to catch up (totally about 3600 batches), since we
do not have any new files at this period, nor do have any window
operations, so these loops do nothing valued.

*Is there any way to skip these batches or to speed up the catch up
processing?*

Thanks!
Terry

Reply via email to