Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?
One workaround would be remove/move the files from the input directory once you have it processed. Thanks Best Regards On Fri, Jun 19, 2015 at 5:48 AM, Haopu Wang hw...@qilinsoft.com wrote: Akhil, From my test, I can see the files in the last batch will alwyas be reprocessed upon restarting from checkpoint even for graceful shutdown. I think usually the file is expected to be processed only once. Maybe this is a bug in fileStream? or do you know any approach to workaround it? Much thanks! -- *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Tuesday, June 16, 2015 3:26 PM *To:* Haopu Wang *Cc:* user *Subject:* Re: If not stop StreamingContext gracefully, will checkpoint data be consistent? Good question, with fileStream or textFileStream basically it will only takes in the files whose timestamp is the current timestamp https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L172 and when checkpointing is enabled https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L324 it would restore the latest filenames from the checkpoint directory which i believe will kind of reprocess some files. Thanks Best Regards On Mon, Jun 15, 2015 at 2:49 PM, Haopu Wang hw...@qilinsoft.com wrote: Akhil, thank you for the response. I want to explore more. If the application is just monitoring a HDFS folder and output the word count of each streaming batch into also HDFS. When I kill the application _*before*_ spark takes a checkpoint, after recovery, spark will resume the processing from the timestamp of latest checkpoint. That means some files will be processed twice and duplicate results are generated. Please correct me if the understanding is wrong, thanks again! -- *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Monday, June 15, 2015 3:48 PM *To:* Haopu Wang *Cc:* user *Subject:* Re: If not stop StreamingContext gracefully, will checkpoint data be consistent? I think it should be fine, that's the whole point of check-pointing (in case of driver failure etc). Thanks Best Regards On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang hw...@qilinsoft.com wrote: Hi, can someone help to confirm the behavior? Thank you! -Original Message- From: Haopu Wang Sent: Friday, June 12, 2015 4:57 PM To: user Subject: If not stop StreamingContext gracefully, will checkpoint data be consistent? This is a quick question about Checkpoint. The question is: if the StreamingContext is not stopped gracefully, will the checkpoint be consistent? Or I should always gracefully shutdown the application even in order to use the checkpoint? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?
Akhil, From my test, I can see the files in the last batch will alwyas be reprocessed upon restarting from checkpoint even for graceful shutdown. I think usually the file is expected to be processed only once. Maybe this is a bug in fileStream? or do you know any approach to workaround it? Much thanks! From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Tuesday, June 16, 2015 3:26 PM To: Haopu Wang Cc: user Subject: Re: If not stop StreamingContext gracefully, will checkpoint data be consistent? Good question, with fileStream or textFileStream basically it will only takes in the files whose timestamp is the current timestamp https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc 7d00bf/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileI nputDStream.scala#L172 and when checkpointing is enabled https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc 7d00bf/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileI nputDStream.scala#L324 it would restore the latest filenames from the checkpoint directory which i believe will kind of reprocess some files. Thanks Best Regards On Mon, Jun 15, 2015 at 2:49 PM, Haopu Wang hw...@qilinsoft.com wrote: Akhil, thank you for the response. I want to explore more. If the application is just monitoring a HDFS folder and output the word count of each streaming batch into also HDFS. When I kill the application _before_ spark takes a checkpoint, after recovery, spark will resume the processing from the timestamp of latest checkpoint. That means some files will be processed twice and duplicate results are generated. Please correct me if the understanding is wrong, thanks again! From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Monday, June 15, 2015 3:48 PM To: Haopu Wang Cc: user Subject: Re: If not stop StreamingContext gracefully, will checkpoint data be consistent? I think it should be fine, that's the whole point of check-pointing (in case of driver failure etc). Thanks Best Regards On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang hw...@qilinsoft.com wrote: Hi, can someone help to confirm the behavior? Thank you! -Original Message- From: Haopu Wang Sent: Friday, June 12, 2015 4:57 PM To: user Subject: If not stop StreamingContext gracefully, will checkpoint data be consistent? This is a quick question about Checkpoint. The question is: if the StreamingContext is not stopped gracefully, will the checkpoint be consistent? Or I should always gracefully shutdown the application even in order to use the checkpoint? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?
Good question, with fileStream or textFileStream basically it will only takes in the files whose timestamp is the current timestamp https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L172 and when checkpointing is enabled https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L324 it would restore the latest filenames from the checkpoint directory which i believe will kind of reprocess some files. Thanks Best Regards On Mon, Jun 15, 2015 at 2:49 PM, Haopu Wang hw...@qilinsoft.com wrote: Akhil, thank you for the response. I want to explore more. If the application is just monitoring a HDFS folder and output the word count of each streaming batch into also HDFS. When I kill the application _*before*_ spark takes a checkpoint, after recovery, spark will resume the processing from the timestamp of latest checkpoint. That means some files will be processed twice and duplicate results are generated. Please correct me if the understanding is wrong, thanks again! -- *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Monday, June 15, 2015 3:48 PM *To:* Haopu Wang *Cc:* user *Subject:* Re: If not stop StreamingContext gracefully, will checkpoint data be consistent? I think it should be fine, that's the whole point of check-pointing (in case of driver failure etc). Thanks Best Regards On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang hw...@qilinsoft.com wrote: Hi, can someone help to confirm the behavior? Thank you! -Original Message- From: Haopu Wang Sent: Friday, June 12, 2015 4:57 PM To: user Subject: If not stop StreamingContext gracefully, will checkpoint data be consistent? This is a quick question about Checkpoint. The question is: if the StreamingContext is not stopped gracefully, will the checkpoint be consistent? Or I should always gracefully shutdown the application even in order to use the checkpoint? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?
Akhil, thank you for the response. I want to explore more. If the application is just monitoring a HDFS folder and output the word count of each streaming batch into also HDFS. When I kill the application _before_ spark takes a checkpoint, after recovery, spark will resume the processing from the timestamp of latest checkpoint. That means some files will be processed twice and duplicate results are generated. Please correct me if the understanding is wrong, thanks again! From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Monday, June 15, 2015 3:48 PM To: Haopu Wang Cc: user Subject: Re: If not stop StreamingContext gracefully, will checkpoint data be consistent? I think it should be fine, that's the whole point of check-pointing (in case of driver failure etc). Thanks Best Regards On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang hw...@qilinsoft.com wrote: Hi, can someone help to confirm the behavior? Thank you! -Original Message- From: Haopu Wang Sent: Friday, June 12, 2015 4:57 PM To: user Subject: If not stop StreamingContext gracefully, will checkpoint data be consistent? This is a quick question about Checkpoint. The question is: if the StreamingContext is not stopped gracefully, will the checkpoint be consistent? Or I should always gracefully shutdown the application even in order to use the checkpoint? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?
I think it should be fine, that's the whole point of check-pointing (in case of driver failure etc). Thanks Best Regards On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang hw...@qilinsoft.com wrote: Hi, can someone help to confirm the behavior? Thank you! -Original Message- From: Haopu Wang Sent: Friday, June 12, 2015 4:57 PM To: user Subject: If not stop StreamingContext gracefully, will checkpoint data be consistent? This is a quick question about Checkpoint. The question is: if the StreamingContext is not stopped gracefully, will the checkpoint be consistent? Or I should always gracefully shutdown the application even in order to use the checkpoint? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?
Hi, can someone help to confirm the behavior? Thank you! -Original Message- From: Haopu Wang Sent: Friday, June 12, 2015 4:57 PM To: user Subject: If not stop StreamingContext gracefully, will checkpoint data be consistent? This is a quick question about Checkpoint. The question is: if the StreamingContext is not stopped gracefully, will the checkpoint be consistent? Or I should always gracefully shutdown the application even in order to use the checkpoint? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
If not stop StreamingContext gracefully, will checkpoint data be consistent?
This is a quick question about Checkpoint. The question is: if the StreamingContext is not stopped gracefully, will the checkpoint be consistent? Or I should always gracefully shutdown the application even in order to use the checkpoint? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org