[GitHub] [spark] HyukjinKwon commented on a change in pull request #25407: [SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter

2019-08-11 Thread GitBox
HyukjinKwon commented on a change in pull request #25407: 
[SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter
URL: https://github.com/apache/spark/pull/25407#discussion_r312774692
 
 

 ##
 File path: docs/structured-streaming-programming-guide.md
 ##
 @@ -2251,13 +2251,13 @@ When the streaming query is started, Spark calls the 
function or the object’s
 
 - The close() method (if it exists) is called if an open() method exists and 
returns successfully (irrespective of the return value), except if the JVM or 
Python process crashes in the middle.
 
-- **Note:** The partitionId and epochId in the open() method can be used to 
deduplicate generated data 
-  when failures cause reprocessing of some input data. This depends on the 
execution mode of the query. 
-  If the streaming query is being executed in the micro-batch mode, then every 
partition represented 
-  by a unique tuple (partition_id, epoch_id) is guaranteed to have the same 
data. 
-  Hence, (partition_id, epoch_id) can be used to deduplicate and/or 
transactionally commit 
-  data and achieve exactly-once guarantees. However, if the streaming query is 
being executed 
-  in the continuous mode, then this guarantee does not hold and therefore 
should not be used for deduplication.
+- **Note:** Spark doesn't guarantee same output for (partitionId, epochId) on 
failure, so deduplication
+  cannot be achieved with (partitionId, epochId). e.g. source provides 
different number of
+  partitions for some reason, Spark optimization changes number of partitions, 
etc.
+  Refer SPARK-28650 for more details. `epochId` can still be used for 
deduplication, but there's less
 
 Review comment:
   Just to match with other doc:
   
   ```
   See [SPARK-28650](https://issues.apache.org/jira/browse/SPARK-28650) for 
more details.
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #25407: [SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter

2019-08-11 Thread GitBox
HyukjinKwon commented on a change in pull request #25407: 
[SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter
URL: https://github.com/apache/spark/pull/25407#discussion_r312774510
 
 

 ##
 File path: docs/structured-streaming-programming-guide.md
 ##
 @@ -2251,13 +2251,13 @@ When the streaming query is started, Spark calls the 
function or the object’s
 
 - The close() method (if it exists) is called if an open() method exists and 
returns successfully (irrespective of the return value), except if the JVM or 
Python process crashes in the middle.
 
-- **Note:** The partitionId and epochId in the open() method can be used to 
deduplicate generated data 
-  when failures cause reprocessing of some input data. This depends on the 
execution mode of the query. 
-  If the streaming query is being executed in the micro-batch mode, then every 
partition represented 
-  by a unique tuple (partition_id, epoch_id) is guaranteed to have the same 
data. 
-  Hence, (partition_id, epoch_id) can be used to deduplicate and/or 
transactionally commit 
-  data and achieve exactly-once guarantees. However, if the streaming query is 
being executed 
-  in the continuous mode, then this guarantee does not hold and therefore 
should not be used for deduplication.
+- **Note:** Spark doesn't guarantee same output for (partitionId, epochId) on 
failure, so deduplication
+  cannot be achieved with (partitionId, epochId). e.g. source provides 
different number of
+  partitions for some reason, Spark optimization changes number of partitions, 
etc.
+  Refer SPARK-28650 for more details. `epochId` can still be used for 
deduplication, but there's less
+  benefit to leverage this, as the chance for Spark to successfully write all 
partitions and fail to checkpoint
 
 Review comment:
   Using epoch seems not quite useful given the description. Should we maybe 
just remove it out?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #25407: [SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter

2019-08-11 Thread GitBox
HyukjinKwon commented on a change in pull request #25407: 
[SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter
URL: https://github.com/apache/spark/pull/25407#discussion_r312773068
 
 

 ##
 File path: docs/structured-streaming-programming-guide.md
 ##
 @@ -2251,13 +2251,13 @@ When the streaming query is started, Spark calls the 
function or the object’s
 
 - The close() method (if it exists) is called if an open() method exists and 
returns successfully (irrespective of the return value), except if the JVM or 
Python process crashes in the middle.
 
-- **Note:** The partitionId and epochId in the open() method can be used to 
deduplicate generated data 
-  when failures cause reprocessing of some input data. This depends on the 
execution mode of the query. 
-  If the streaming query is being executed in the micro-batch mode, then every 
partition represented 
-  by a unique tuple (partition_id, epoch_id) is guaranteed to have the same 
data. 
-  Hence, (partition_id, epoch_id) can be used to deduplicate and/or 
transactionally commit 
-  data and achieve exactly-once guarantees. However, if the streaming query is 
being executed 
-  in the continuous mode, then this guarantee does not hold and therefore 
should not be used for deduplication.
+- **Note:** Spark doesn't guarantee same output for (partitionId, epochId) on 
failure, so deduplication
+  cannot be achieved with (partitionId, epochId). e.g. source provides 
different number of
+  partitions for some reason, Spark optimization changes number of partitions, 
etc.
 
 Review comment:
   typo: some reasons


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #25407: [SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter

2019-08-11 Thread GitBox
HyukjinKwon commented on a change in pull request #25407: 
[SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter
URL: https://github.com/apache/spark/pull/25407#discussion_r312772909
 
 

 ##
 File path: docs/structured-streaming-programming-guide.md
 ##
 @@ -2251,13 +2251,13 @@ When the streaming query is started, Spark calls the 
function or the object’s
 
 - The close() method (if it exists) is called if an open() method exists and 
returns successfully (irrespective of the return value), except if the JVM or 
Python process crashes in the middle.
 
-- **Note:** The partitionId and epochId in the open() method can be used to 
deduplicate generated data 
-  when failures cause reprocessing of some input data. This depends on the 
execution mode of the query. 
-  If the streaming query is being executed in the micro-batch mode, then every 
partition represented 
-  by a unique tuple (partition_id, epoch_id) is guaranteed to have the same 
data. 
-  Hence, (partition_id, epoch_id) can be used to deduplicate and/or 
transactionally commit 
-  data and achieve exactly-once guarantees. However, if the streaming query is 
being executed 
-  in the continuous mode, then this guarantee does not hold and therefore 
should not be used for deduplication.
+- **Note:** Spark doesn't guarantee same output for (partitionId, epochId) on 
failure, so deduplication
+  cannot be achieved with (partitionId, epochId). e.g. source provides 
different number of
+  partitions for some reason, Spark optimization changes number of partitions, 
etc.
 
 Review comment:
   typo same reason?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #25407: [SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter

2019-08-11 Thread GitBox
HyukjinKwon commented on a change in pull request #25407: 
[SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter
URL: https://github.com/apache/spark/pull/25407#discussion_r312772909
 
 

 ##
 File path: docs/structured-streaming-programming-guide.md
 ##
 @@ -2251,13 +2251,13 @@ When the streaming query is started, Spark calls the 
function or the object’s
 
 - The close() method (if it exists) is called if an open() method exists and 
returns successfully (irrespective of the return value), except if the JVM or 
Python process crashes in the middle.
 
-- **Note:** The partitionId and epochId in the open() method can be used to 
deduplicate generated data 
-  when failures cause reprocessing of some input data. This depends on the 
execution mode of the query. 
-  If the streaming query is being executed in the micro-batch mode, then every 
partition represented 
-  by a unique tuple (partition_id, epoch_id) is guaranteed to have the same 
data. 
-  Hence, (partition_id, epoch_id) can be used to deduplicate and/or 
transactionally commit 
-  data and achieve exactly-once guarantees. However, if the streaming query is 
being executed 
-  in the continuous mode, then this guarantee does not hold and therefore 
should not be used for deduplication.
+- **Note:** Spark doesn't guarantee same output for (partitionId, epochId) on 
failure, so deduplication
+  cannot be achieved with (partitionId, epochId). e.g. source provides 
different number of
+  partitions for some reason, Spark optimization changes number of partitions, 
etc.
 
 Review comment:
   typo same reason?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #25407: [SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter

2019-08-11 Thread GitBox
HyukjinKwon commented on a change in pull request #25407: 
[SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter
URL: https://github.com/apache/spark/pull/25407#discussion_r312772909
 
 

 ##
 File path: docs/structured-streaming-programming-guide.md
 ##
 @@ -2251,13 +2251,13 @@ When the streaming query is started, Spark calls the 
function or the object’s
 
 - The close() method (if it exists) is called if an open() method exists and 
returns successfully (irrespective of the return value), except if the JVM or 
Python process crashes in the middle.
 
-- **Note:** The partitionId and epochId in the open() method can be used to 
deduplicate generated data 
-  when failures cause reprocessing of some input data. This depends on the 
execution mode of the query. 
-  If the streaming query is being executed in the micro-batch mode, then every 
partition represented 
-  by a unique tuple (partition_id, epoch_id) is guaranteed to have the same 
data. 
-  Hence, (partition_id, epoch_id) can be used to deduplicate and/or 
transactionally commit 
-  data and achieve exactly-once guarantees. However, if the streaming query is 
being executed 
-  in the continuous mode, then this guarantee does not hold and therefore 
should not be used for deduplication.
+- **Note:** Spark doesn't guarantee same output for (partitionId, epochId) on 
failure, so deduplication
+  cannot be achieved with (partitionId, epochId). e.g. source provides 
different number of
+  partitions for some reason, Spark optimization changes number of partitions, 
etc.
 
 Review comment:
   typo some reasons


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #25407: [SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter

2019-08-11 Thread GitBox
HyukjinKwon commented on a change in pull request #25407: 
[SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter
URL: https://github.com/apache/spark/pull/25407#discussion_r312772609
 
 

 ##
 File path: docs/structured-streaming-programming-guide.md
 ##
 @@ -2251,13 +2251,13 @@ When the streaming query is started, Spark calls the 
function or the object’s
 
 - The close() method (if it exists) is called if an open() method exists and 
returns successfully (irrespective of the return value), except if the JVM or 
Python process crashes in the middle.
 
-- **Note:** The partitionId and epochId in the open() method can be used to 
deduplicate generated data 
-  when failures cause reprocessing of some input data. This depends on the 
execution mode of the query. 
-  If the streaming query is being executed in the micro-batch mode, then every 
partition represented 
-  by a unique tuple (partition_id, epoch_id) is guaranteed to have the same 
data. 
-  Hence, (partition_id, epoch_id) can be used to deduplicate and/or 
transactionally commit 
-  data and achieve exactly-once guarantees. However, if the streaming query is 
being executed 
-  in the continuous mode, then this guarantee does not hold and therefore 
should not be used for deduplication.
+- **Note:** Spark doesn't guarantee same output for (partitionId, epochId) on 
failure, so deduplication
 
 Review comment:
   no big deal but I usually avoid abbreviation in the doc. `doesn't` -> `does 
not`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org