[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user jeanlyn closed the pull request at: https://github.com/apache/spark/pull/11440 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11440#issuecomment-190994425 Thanks @jerryshao @srowen @zsxwing for suggestions.I close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/11440#discussion_r54610922 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala --- @@ -221,8 +221,12 @@ class JobGenerator(jobScheduler: JobScheduler) extends Logging { logInfo("Batches pending processing (" + pendingTimes.size + " batches): " + pendingTimes.mkString(", ")) // Reschedule jobs for these times -val timesToReschedule = (pendingTimes ++ downTimes).filter { _ < restartTime } - .distinct.sorted(Time.ordering) +val skipDownTime = conf.getBoolean("spark.streaming.skipDownTimeBatch", false) --- End diff -- Agreed that not need to add this configuration. People can just remove the checkpoint instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/11440#discussion_r54551767 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala --- @@ -221,8 +221,12 @@ class JobGenerator(jobScheduler: JobScheduler) extends Logging { logInfo("Batches pending processing (" + pendingTimes.size + " batches): " + pendingTimes.mkString(", ")) // Reschedule jobs for these times -val timesToReschedule = (pendingTimes ++ downTimes).filter { _ < restartTime } - .distinct.sorted(Time.ordering) +val skipDownTime = conf.getBoolean("spark.streaming.skipDownTimeBatch", false) --- End diff -- I'd prefer not to add yet another configuration to control this. It adds complexity. I don't think the name is descriptive here; what is a 'down time batch'? The current behavior is coherent, since the expected behavior is to pick up where it left off. It's not intended that you leave the job not running for a long time relative to the batch interval. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11440#issuecomment-190613342 My bad. I will try to figure out the way to fix the when window operations appear with the config set to true. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/11440#issuecomment-190610110 But how do you define "much longer", based on the batch number or time? IMHO we cannot fix a patch based on the assumptions. We should add some defensive codes to make sure the logic is still consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11440#issuecomment-190608101 @jerryshao Thanks for the explanation. I see what you mean. It's only happen in the beginning, and if the stop time is much longer than the window time, i think it's acceptable to skip those down time batch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/11440#issuecomment-190570144 For example, if your sliding duration is 1, window duration is 4, and batch duration is 1, and the down time is 3. If you skip this this 3 batches, IIUC the result will be wrong, --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user jeanlyn commented on the pull request: https://github.com/apache/spark/pull/11440#issuecomment-190568465 Thanks @jerryshao for suggestion! > Jobs generated in the down time can be used for WAL replay, did you test when these down jobs are removed, the behavior of WAL replay is still correct? It seems that the `pendingTimes` is use for WAL replay, i do not skip these batches > Also for some windowing operations, I think this removal of down time jobs may possibly lead to the inconsistent result of windowing aggregation. Does inconsistent result mean wrong result? Also, i will running the unit test with the config set to true by default in my local computer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/11440#issuecomment-190531231 Also for some windowing operations, I think this removal of down time jobs may possibly lead to the inconsistent result of windowing aggregation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13586][STREAMING]add config to skip gen...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/11440#issuecomment-190530543 Jobs generated in the down time can be used for WAL replay, did you test when these down jobs are removed, the behavior of WAL replay is still correct? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org