[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-03-01 Thread jeanlyn
Github user jeanlyn closed the pull request at:

https://github.com/apache/spark/pull/11440


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-03-01 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11440#issuecomment-190994425
  
Thanks @jerryshao  @srowen @zsxwing for suggestions.I close this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-03-01 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/11440#discussion_r54610922
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
 ---
@@ -221,8 +221,12 @@ class JobGenerator(jobScheduler: JobScheduler) extends 
Logging {
 logInfo("Batches pending processing (" + pendingTimes.size + " 
batches): " +
   pendingTimes.mkString(", "))
 // Reschedule jobs for these times
-val timesToReschedule = (pendingTimes ++ downTimes).filter { _ < 
restartTime }
-  .distinct.sorted(Time.ordering)
+val skipDownTime = 
conf.getBoolean("spark.streaming.skipDownTimeBatch", false)
--- End diff --

Agreed that not need to add this configuration. People can just remove the 
checkpoint instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-03-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/11440#discussion_r54551767
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
 ---
@@ -221,8 +221,12 @@ class JobGenerator(jobScheduler: JobScheduler) extends 
Logging {
 logInfo("Batches pending processing (" + pendingTimes.size + " 
batches): " +
   pendingTimes.mkString(", "))
 // Reschedule jobs for these times
-val timesToReschedule = (pendingTimes ++ downTimes).filter { _ < 
restartTime }
-  .distinct.sorted(Time.ordering)
+val skipDownTime = 
conf.getBoolean("spark.streaming.skipDownTimeBatch", false)
--- End diff --

I'd prefer not to add yet another configuration to control this. It adds 
complexity. I don't think the name is descriptive here; what is a 'down time 
batch'? The current behavior is coherent, since the expected behavior is to 
pick up where it left off. It's not intended that you leave the job not running 
for a long time relative to the batch interval.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-03-01 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11440#issuecomment-190613342
  
My bad. I will try to figure out the way to fix the when window operations 
appear with the config set to true.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-03-01 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/11440#issuecomment-190610110
  
But how do you define "much longer", based on the batch number or time? 
IMHO we cannot fix a patch based on the assumptions. We should add some 
defensive codes to make sure the logic is still consistent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-03-01 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11440#issuecomment-190608101
  
@jerryshao Thanks for the explanation. I see what you mean. It's only 
happen in the beginning, and if the stop time is much longer than the window 
time, i think it's acceptable to skip those down time batch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-02-29 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/11440#issuecomment-190570144
  
For example, if your sliding duration is 1, window duration is 4, and batch 
duration is 1, and the down time is 3. If you skip this this 3 batches, IIUC 
the result will be wrong, 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-02-29 Thread jeanlyn
Github user jeanlyn commented on the pull request:

https://github.com/apache/spark/pull/11440#issuecomment-190568465
  
Thanks @jerryshao for suggestion!
> Jobs generated in the down time can be used for WAL replay, did you test 
when these down jobs are removed, the behavior of WAL replay is still correct?

It seems that the `pendingTimes` is use for WAL replay, i do not skip these 
batches 

> Also for some windowing operations, I think this removal of down time 
jobs may possibly lead to the inconsistent result of windowing aggregation.

Does inconsistent result mean wrong result?

Also, i will running the unit test with the config set to true by default 
in my local computer.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-02-29 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/11440#issuecomment-190531231
  
Also for some windowing operations, I think this removal of down time jobs 
may possibly lead to the inconsistent result of windowing aggregation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...

2016-02-29 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/11440#issuecomment-190530543
  
Jobs generated in the down time can be used for WAL replay, did you test 
when these down jobs are removed, the behavior of WAL replay is still correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org