[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted
[ https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266655#comment-16266655 ] Apache Spark commented on SPARK-18905: -- User 'victor-wong' has created a pull request for this issue: https://github.com/apache/spark/pull/19824 > Potential Issue of Semantics of BatchCompleted > -- > > Key: SPARK-18905 > URL: https://issues.apache.org/jira/browse/SPARK-18905 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.5.1, 1.5.2, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2 >Reporter: Nan Zhu >Assignee: Nan Zhu > Fix For: 2.1.1, 2.2.0 > > > the current implementation of Spark streaming considers a batch is completed > no matter the results of the jobs > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203) > Let's consider the following case: > A micro batch contains 2 jobs and they read from two different kafka topics > respectively. One of these jobs is failed due to some problem in the user > defined logic, after the other one is finished successfully. > 1. The main thread in the Spark streaming application will execute the line > mentioned above, > 2. and another thread (checkpoint writer) will make a checkpoint file > immediately after this line is executed. > 3. Then due to the current error handling mechanism in Spark Streaming, > StreamingContext will be closed > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214) > the user recovers from the checkpoint file, and because the JobSet containing > the failed job has been removed (taken as completed) before the checkpoint is > constructed, the data being processed by the failed job would never be > reprocessed? > I might have missed something in the checkpoint thread or this > handleJobCompletion()or it is a potential bug -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted
[ https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816646#comment-15816646 ] Apache Spark commented on SPARK-18905: -- User 'CodingCat' has created a pull request for this issue: https://github.com/apache/spark/pull/16542 > Potential Issue of Semantics of BatchCompleted > -- > > Key: SPARK-18905 > URL: https://issues.apache.org/jira/browse/SPARK-18905 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Nan Zhu > > the current implementation of Spark streaming considers a batch is completed > no matter the results of the jobs > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203) > Let's consider the following case: > A micro batch contains 2 jobs and they read from two different kafka topics > respectively. One of these jobs is failed due to some problem in the user > defined logic, after the other one is finished successfully. > 1. The main thread in the Spark streaming application will execute the line > mentioned above, > 2. and another thread (checkpoint writer) will make a checkpoint file > immediately after this line is executed. > 3. Then due to the current error handling mechanism in Spark Streaming, > StreamingContext will be closed > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214) > the user recovers from the checkpoint file, and because the JobSet containing > the failed job has been removed (taken as completed) before the checkpoint is > constructed, the data being processed by the failed job would never be > reprocessed? > I might have missed something in the checkpoint thread or this > handleJobCompletion()or it is a potential bug -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted
[ https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15815815#comment-15815815 ] Shixiong Zhu commented on SPARK-18905: -- Sure. Please go ahead. > Potential Issue of Semantics of BatchCompleted > -- > > Key: SPARK-18905 > URL: https://issues.apache.org/jira/browse/SPARK-18905 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Nan Zhu > > the current implementation of Spark streaming considers a batch is completed > no matter the results of the jobs > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203) > Let's consider the following case: > A micro batch contains 2 jobs and they read from two different kafka topics > respectively. One of these jobs is failed due to some problem in the user > defined logic, after the other one is finished successfully. > 1. The main thread in the Spark streaming application will execute the line > mentioned above, > 2. and another thread (checkpoint writer) will make a checkpoint file > immediately after this line is executed. > 3. Then due to the current error handling mechanism in Spark Streaming, > StreamingContext will be closed > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214) > the user recovers from the checkpoint file, and because the JobSet containing > the failed job has been removed (taken as completed) before the checkpoint is > constructed, the data being processed by the failed job would never be > reprocessed? > I might have missed something in the checkpoint thread or this > handleJobCompletion()or it is a potential bug -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted
[ https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813632#comment-15813632 ] Nan Zhu commented on SPARK-18905: - [~zsxwing] If you agree on the conclusion above, I will file a PR > Potential Issue of Semantics of BatchCompleted > -- > > Key: SPARK-18905 > URL: https://issues.apache.org/jira/browse/SPARK-18905 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Nan Zhu > > the current implementation of Spark streaming considers a batch is completed > no matter the results of the jobs > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203) > Let's consider the following case: > A micro batch contains 2 jobs and they read from two different kafka topics > respectively. One of these jobs is failed due to some problem in the user > defined logic, after the other one is finished successfully. > 1. The main thread in the Spark streaming application will execute the line > mentioned above, > 2. and another thread (checkpoint writer) will make a checkpoint file > immediately after this line is executed. > 3. Then due to the current error handling mechanism in Spark Streaming, > StreamingContext will be closed > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214) > the user recovers from the checkpoint file, and because the JobSet containing > the failed job has been removed (taken as completed) before the checkpoint is > constructed, the data being processed by the failed job would never be > reprocessed? > I might have missed something in the checkpoint thread or this > handleJobCompletion()or it is a potential bug -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted
[ https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813560#comment-15813560 ] Nan Zhu commented on SPARK-18905: - eat my words... when we have queued up batches, we do need pendingTime, and yes, the original description in the JIRA still holds, > Potential Issue of Semantics of BatchCompleted > -- > > Key: SPARK-18905 > URL: https://issues.apache.org/jira/browse/SPARK-18905 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Nan Zhu > > the current implementation of Spark streaming considers a batch is completed > no matter the results of the jobs > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203) > Let's consider the following case: > A micro batch contains 2 jobs and they read from two different kafka topics > respectively. One of these jobs is failed due to some problem in the user > defined logic, after the other one is finished successfully. > 1. The main thread in the Spark streaming application will execute the line > mentioned above, > 2. and another thread (checkpoint writer) will make a checkpoint file > immediately after this line is executed. > 3. Then due to the current error handling mechanism in Spark Streaming, > StreamingContext will be closed > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214) > the user recovers from the checkpoint file, and because the JobSet containing > the failed job has been removed (taken as completed) before the checkpoint is > constructed, the data being processed by the failed job would never be > reprocessed? > I might have missed something in the checkpoint thread or this > handleJobCompletion()or it is a potential bug -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted
[ https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813459#comment-15813459 ] Nan Zhu commented on SPARK-18905: - yeah, but the downTime including all batches from "checkpoint time" to "restart time" the jobs that have been generated but not completed shall be the first batch in downTime...no? > Potential Issue of Semantics of BatchCompleted > -- > > Key: SPARK-18905 > URL: https://issues.apache.org/jira/browse/SPARK-18905 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Nan Zhu > > the current implementation of Spark streaming considers a batch is completed > no matter the results of the jobs > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203) > Let's consider the following case: > A micro batch contains 2 jobs and they read from two different kafka topics > respectively. One of these jobs is failed due to some problem in the user > defined logic, after the other one is finished successfully. > 1. The main thread in the Spark streaming application will execute the line > mentioned above, > 2. and another thread (checkpoint writer) will make a checkpoint file > immediately after this line is executed. > 3. Then due to the current error handling mechanism in Spark Streaming, > StreamingContext will be closed > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214) > the user recovers from the checkpoint file, and because the JobSet containing > the failed job has been removed (taken as completed) before the checkpoint is > constructed, the data being processed by the failed job would never be > reprocessed? > I might have missed something in the checkpoint thread or this > handleJobCompletion()or it is a potential bug -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted
[ https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813455#comment-15813455 ] Shixiong Zhu commented on SPARK-18905: -- [~CodingCat] I think `pendingTime` is the jobs that have been generated but not completed. I think you are right in the JIRA description. The failed job should not be removed so that they can be included in the `getPendingTimes`. Could you submit a PR to fix it? > Potential Issue of Semantics of BatchCompleted > -- > > Key: SPARK-18905 > URL: https://issues.apache.org/jira/browse/SPARK-18905 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Nan Zhu > > the current implementation of Spark streaming considers a batch is completed > no matter the results of the jobs > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203) > Let's consider the following case: > A micro batch contains 2 jobs and they read from two different kafka topics > respectively. One of these jobs is failed due to some problem in the user > defined logic, after the other one is finished successfully. > 1. The main thread in the Spark streaming application will execute the line > mentioned above, > 2. and another thread (checkpoint writer) will make a checkpoint file > immediately after this line is executed. > 3. Then due to the current error handling mechanism in Spark Streaming, > StreamingContext will be closed > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214) > the user recovers from the checkpoint file, and because the JobSet containing > the failed job has been removed (taken as completed) before the checkpoint is > constructed, the data being processed by the failed job would never be > reprocessed? > I might have missed something in the checkpoint thread or this > handleJobCompletion()or it is a potential bug -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted
[ https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813434#comment-15813434 ] Nan Zhu commented on SPARK-18905: - Hi, [~zsxwing] Thanks for the reply, After testing in our environment for more times, I feel that this is not a problem anymore. The failed job would be recovered https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L216, as the downTime The question right now is, why we need to have pendingTime + downTime in the above method, > Potential Issue of Semantics of BatchCompleted > -- > > Key: SPARK-18905 > URL: https://issues.apache.org/jira/browse/SPARK-18905 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Nan Zhu > > the current implementation of Spark streaming considers a batch is completed > no matter the results of the jobs > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203) > Let's consider the following case: > A micro batch contains 2 jobs and they read from two different kafka topics > respectively. One of these jobs is failed due to some problem in the user > defined logic, after the other one is finished successfully. > 1. The main thread in the Spark streaming application will execute the line > mentioned above, > 2. and another thread (checkpoint writer) will make a checkpoint file > immediately after this line is executed. > 3. Then due to the current error handling mechanism in Spark Streaming, > StreamingContext will be closed > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214) > the user recovers from the checkpoint file, and because the JobSet containing > the failed job has been removed (taken as completed) before the checkpoint is > constructed, the data being processed by the failed job would never be > reprocessed? > I might have missed something in the checkpoint thread or this > handleJobCompletion()or it is a potential bug -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted
[ https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813405#comment-15813405 ] Shixiong Zhu commented on SPARK-18905: -- Sorry for the late reply. Yeah, good catch. However, even if it doesn't complete job, the failed job won't be run after recovery. Right? I think the root cause is that a failed job doesn't stop the StreamingContext. This might be a huge behavior change. cc [~tdas] > Potential Issue of Semantics of BatchCompleted > -- > > Key: SPARK-18905 > URL: https://issues.apache.org/jira/browse/SPARK-18905 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: Nan Zhu > > the current implementation of Spark streaming considers a batch is completed > no matter the results of the jobs > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203) > Let's consider the following case: > A micro batch contains 2 jobs and they read from two different kafka topics > respectively. One of these jobs is failed due to some problem in the user > defined logic, after the other one is finished successfully. > 1. The main thread in the Spark streaming application will execute the line > mentioned above, > 2. and another thread (checkpoint writer) will make a checkpoint file > immediately after this line is executed. > 3. Then due to the current error handling mechanism in Spark Streaming, > StreamingContext will be closed > (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214) > the user recovers from the checkpoint file, and because the JobSet containing > the failed job has been removed (taken as completed) before the checkpoint is > constructed, the data being processed by the failed job would never be > reprocessed? > I might have missed something in the checkpoint thread or this > handleJobCompletion()or it is a potential bug -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org