[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted

2017-11-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266655#comment-16266655
 ] 

Apache Spark commented on SPARK-18905:
--

User 'victor-wong' has created a pull request for this issue:
https://github.com/apache/spark/pull/19824

> Potential Issue of Semantics of BatchCompleted
> --
>
> Key: SPARK-18905
> URL: https://issues.apache.org/jira/browse/SPARK-18905
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.5.1, 1.5.2, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2
>Reporter: Nan Zhu
>Assignee: Nan Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> the current implementation of Spark streaming considers a batch is completed 
> no matter the results of the jobs 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
> Let's consider the following case:
> A micro batch contains 2 jobs and they read from two different kafka topics 
> respectively. One of these jobs is failed due to some problem in the user 
> defined logic, after the other one is finished successfully. 
> 1. The main thread in the Spark streaming application will execute the line 
> mentioned above, 
> 2. and another thread (checkpoint writer) will make a checkpoint file 
> immediately after this line is executed. 
> 3. Then due to the current error handling mechanism in Spark Streaming, 
> StreamingContext will be closed 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
> the user recovers from the checkpoint file, and because the JobSet containing 
> the failed job has been removed (taken as completed) before the checkpoint is 
> constructed, the data being processed by the failed job would never be 
> reprocessed?
> I might have missed something in the checkpoint thread or this 
> handleJobCompletion()or it is a potential bug 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted

2017-01-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816646#comment-15816646
 ] 

Apache Spark commented on SPARK-18905:
--

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/16542

> Potential Issue of Semantics of BatchCompleted
> --
>
> Key: SPARK-18905
> URL: https://issues.apache.org/jira/browse/SPARK-18905
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Nan Zhu
>
> the current implementation of Spark streaming considers a batch is completed 
> no matter the results of the jobs 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
> Let's consider the following case:
> A micro batch contains 2 jobs and they read from two different kafka topics 
> respectively. One of these jobs is failed due to some problem in the user 
> defined logic, after the other one is finished successfully. 
> 1. The main thread in the Spark streaming application will execute the line 
> mentioned above, 
> 2. and another thread (checkpoint writer) will make a checkpoint file 
> immediately after this line is executed. 
> 3. Then due to the current error handling mechanism in Spark Streaming, 
> StreamingContext will be closed 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
> the user recovers from the checkpoint file, and because the JobSet containing 
> the failed job has been removed (taken as completed) before the checkpoint is 
> constructed, the data being processed by the failed job would never be 
> reprocessed?
> I might have missed something in the checkpoint thread or this 
> handleJobCompletion()or it is a potential bug 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted

2017-01-10 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15815815#comment-15815815
 ] 

Shixiong Zhu commented on SPARK-18905:
--

Sure. Please go ahead.

> Potential Issue of Semantics of BatchCompleted
> --
>
> Key: SPARK-18905
> URL: https://issues.apache.org/jira/browse/SPARK-18905
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Nan Zhu
>
> the current implementation of Spark streaming considers a batch is completed 
> no matter the results of the jobs 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
> Let's consider the following case:
> A micro batch contains 2 jobs and they read from two different kafka topics 
> respectively. One of these jobs is failed due to some problem in the user 
> defined logic, after the other one is finished successfully. 
> 1. The main thread in the Spark streaming application will execute the line 
> mentioned above, 
> 2. and another thread (checkpoint writer) will make a checkpoint file 
> immediately after this line is executed. 
> 3. Then due to the current error handling mechanism in Spark Streaming, 
> StreamingContext will be closed 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
> the user recovers from the checkpoint file, and because the JobSet containing 
> the failed job has been removed (taken as completed) before the checkpoint is 
> constructed, the data being processed by the failed job would never be 
> reprocessed?
> I might have missed something in the checkpoint thread or this 
> handleJobCompletion()or it is a potential bug 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted

2017-01-09 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813632#comment-15813632
 ] 

Nan Zhu commented on SPARK-18905:
-

[~zsxwing] If you agree on the conclusion above, I will file a PR

> Potential Issue of Semantics of BatchCompleted
> --
>
> Key: SPARK-18905
> URL: https://issues.apache.org/jira/browse/SPARK-18905
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Nan Zhu
>
> the current implementation of Spark streaming considers a batch is completed 
> no matter the results of the jobs 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
> Let's consider the following case:
> A micro batch contains 2 jobs and they read from two different kafka topics 
> respectively. One of these jobs is failed due to some problem in the user 
> defined logic, after the other one is finished successfully. 
> 1. The main thread in the Spark streaming application will execute the line 
> mentioned above, 
> 2. and another thread (checkpoint writer) will make a checkpoint file 
> immediately after this line is executed. 
> 3. Then due to the current error handling mechanism in Spark Streaming, 
> StreamingContext will be closed 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
> the user recovers from the checkpoint file, and because the JobSet containing 
> the failed job has been removed (taken as completed) before the checkpoint is 
> constructed, the data being processed by the failed job would never be 
> reprocessed?
> I might have missed something in the checkpoint thread or this 
> handleJobCompletion()or it is a potential bug 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted

2017-01-09 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813560#comment-15813560
 ] 

Nan Zhu commented on SPARK-18905:
-

eat my words...

when we have queued up batches, we do need pendingTime, 

and yes, the original description in the JIRA still holds, 

> Potential Issue of Semantics of BatchCompleted
> --
>
> Key: SPARK-18905
> URL: https://issues.apache.org/jira/browse/SPARK-18905
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Nan Zhu
>
> the current implementation of Spark streaming considers a batch is completed 
> no matter the results of the jobs 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
> Let's consider the following case:
> A micro batch contains 2 jobs and they read from two different kafka topics 
> respectively. One of these jobs is failed due to some problem in the user 
> defined logic, after the other one is finished successfully. 
> 1. The main thread in the Spark streaming application will execute the line 
> mentioned above, 
> 2. and another thread (checkpoint writer) will make a checkpoint file 
> immediately after this line is executed. 
> 3. Then due to the current error handling mechanism in Spark Streaming, 
> StreamingContext will be closed 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
> the user recovers from the checkpoint file, and because the JobSet containing 
> the failed job has been removed (taken as completed) before the checkpoint is 
> constructed, the data being processed by the failed job would never be 
> reprocessed?
> I might have missed something in the checkpoint thread or this 
> handleJobCompletion()or it is a potential bug 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted

2017-01-09 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813459#comment-15813459
 ] 

Nan Zhu commented on SPARK-18905:
-

yeah, but the downTime including all batches from "checkpoint time" to "restart 
time"

the jobs that have been generated but not completed shall be the first batch in 
downTime...no?

> Potential Issue of Semantics of BatchCompleted
> --
>
> Key: SPARK-18905
> URL: https://issues.apache.org/jira/browse/SPARK-18905
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Nan Zhu
>
> the current implementation of Spark streaming considers a batch is completed 
> no matter the results of the jobs 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
> Let's consider the following case:
> A micro batch contains 2 jobs and they read from two different kafka topics 
> respectively. One of these jobs is failed due to some problem in the user 
> defined logic, after the other one is finished successfully. 
> 1. The main thread in the Spark streaming application will execute the line 
> mentioned above, 
> 2. and another thread (checkpoint writer) will make a checkpoint file 
> immediately after this line is executed. 
> 3. Then due to the current error handling mechanism in Spark Streaming, 
> StreamingContext will be closed 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
> the user recovers from the checkpoint file, and because the JobSet containing 
> the failed job has been removed (taken as completed) before the checkpoint is 
> constructed, the data being processed by the failed job would never be 
> reprocessed?
> I might have missed something in the checkpoint thread or this 
> handleJobCompletion()or it is a potential bug 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted

2017-01-09 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813455#comment-15813455
 ] 

Shixiong Zhu commented on SPARK-18905:
--

[~CodingCat] I think `pendingTime` is the jobs that have been generated but not 
completed.

I think you are right in the JIRA description. The failed job should not be 
removed so that they can be included in the `getPendingTimes`. Could you submit 
a PR to fix it?

> Potential Issue of Semantics of BatchCompleted
> --
>
> Key: SPARK-18905
> URL: https://issues.apache.org/jira/browse/SPARK-18905
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Nan Zhu
>
> the current implementation of Spark streaming considers a batch is completed 
> no matter the results of the jobs 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
> Let's consider the following case:
> A micro batch contains 2 jobs and they read from two different kafka topics 
> respectively. One of these jobs is failed due to some problem in the user 
> defined logic, after the other one is finished successfully. 
> 1. The main thread in the Spark streaming application will execute the line 
> mentioned above, 
> 2. and another thread (checkpoint writer) will make a checkpoint file 
> immediately after this line is executed. 
> 3. Then due to the current error handling mechanism in Spark Streaming, 
> StreamingContext will be closed 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
> the user recovers from the checkpoint file, and because the JobSet containing 
> the failed job has been removed (taken as completed) before the checkpoint is 
> constructed, the data being processed by the failed job would never be 
> reprocessed?
> I might have missed something in the checkpoint thread or this 
> handleJobCompletion()or it is a potential bug 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted

2017-01-09 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813434#comment-15813434
 ] 

Nan Zhu commented on SPARK-18905:
-

Hi, [~zsxwing]

Thanks for the reply, 

After testing in our environment for more times, I feel that this is not a 
problem anymore. The failed job would be recovered 
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L216,
 as the downTime

The question right now is, why we need to have pendingTime + downTime in the 
above method, 

> Potential Issue of Semantics of BatchCompleted
> --
>
> Key: SPARK-18905
> URL: https://issues.apache.org/jira/browse/SPARK-18905
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Nan Zhu
>
> the current implementation of Spark streaming considers a batch is completed 
> no matter the results of the jobs 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
> Let's consider the following case:
> A micro batch contains 2 jobs and they read from two different kafka topics 
> respectively. One of these jobs is failed due to some problem in the user 
> defined logic, after the other one is finished successfully. 
> 1. The main thread in the Spark streaming application will execute the line 
> mentioned above, 
> 2. and another thread (checkpoint writer) will make a checkpoint file 
> immediately after this line is executed. 
> 3. Then due to the current error handling mechanism in Spark Streaming, 
> StreamingContext will be closed 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
> the user recovers from the checkpoint file, and because the JobSet containing 
> the failed job has been removed (taken as completed) before the checkpoint is 
> constructed, the data being processed by the failed job would never be 
> reprocessed?
> I might have missed something in the checkpoint thread or this 
> handleJobCompletion()or it is a potential bug 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18905) Potential Issue of Semantics of BatchCompleted

2017-01-09 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813405#comment-15813405
 ] 

Shixiong Zhu commented on SPARK-18905:
--

Sorry for the late reply. Yeah, good catch. However, even if it doesn't 
complete job, the failed job won't be run after recovery. Right? I think the 
root cause is that a failed job doesn't stop the StreamingContext. This might 
be a huge behavior change.

cc [~tdas]

> Potential Issue of Semantics of BatchCompleted
> --
>
> Key: SPARK-18905
> URL: https://issues.apache.org/jira/browse/SPARK-18905
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Nan Zhu
>
> the current implementation of Spark streaming considers a batch is completed 
> no matter the results of the jobs 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L203)
> Let's consider the following case:
> A micro batch contains 2 jobs and they read from two different kafka topics 
> respectively. One of these jobs is failed due to some problem in the user 
> defined logic, after the other one is finished successfully. 
> 1. The main thread in the Spark streaming application will execute the line 
> mentioned above, 
> 2. and another thread (checkpoint writer) will make a checkpoint file 
> immediately after this line is executed. 
> 3. Then due to the current error handling mechanism in Spark Streaming, 
> StreamingContext will be closed 
> (https://github.com/apache/spark/blob/1169db44bc1d51e68feb6ba2552520b2d660c2c0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala#L214)
> the user recovers from the checkpoint file, and because the JobSet containing 
> the failed job has been removed (taken as completed) before the checkpoint is 
> constructed, the data being processed by the failed job would never be 
> reprocessed?
> I might have missed something in the checkpoint thread or this 
> handleJobCompletion()or it is a potential bug 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org