[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store

2020-06-19 Thread GitBox


xuanyuanking commented on pull request #28707:
URL: https://github.com/apache/spark/pull/28707#issuecomment-646497299


   Thanks all for reviewing!
   I'll review #24173 as the next step for schema validation.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store

2020-06-15 Thread GitBox


xuanyuanking commented on pull request #28707:
URL: https://github.com/apache/spark/pull/28707#issuecomment-644020172


   ```
   Sorry my comment was edited so you may be missed the content, but it is also 
a sort of pointing out for "pinpointing" - do you think your approach works 
with other state store providers as well?
   ```
   @HeartSaVioR I totally agree with your approach is clean enough for all the 
state store provider, that's also what I want to achieve. But as you may find 
that the new regression bug SPARK-31990 is found by key validation, that's why 
we can't directly skip it. Anyway, super thanks for the proposal, let's try to 
find another way together which matches all the requirements, without changing 
the provider API.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store

2020-06-14 Thread GitBox


xuanyuanking commented on pull request #28707:
URL: https://github.com/apache/spark/pull/28707#issuecomment-643916110


   A new regression bug SPARK-31990 was found when investigating the test 
failure https://github.com/apache/spark/pull/28707#issuecomment-639861273. The 
root cause is that [this 
line](https://github.com/apache/spark/pull/28062/files#diff-7a46f10c3cedbf013cf255564d9483cdL2458)
 in SPARK-31292 made the order of groupCols in Deduplicate changed, and the 
order changing will break the validation logic here. That is to say, if we 
don't have this PR, the executor JVM could probably crash, throw a random 
exception, or even return a wrong answer when using the checkpoint written by 
the previous version.
   
   So we have 2 related work of this PR:
   
   - [ ] Fix and merge the compatibility issue in #28830
   - [ ] Add new test(or modify the current Kafka test) in #28725
   
   --
   ### More detailed analysis:
   The expected order of `Deduplicate.groupCols` in UT 
KafkaMicroBatchV2SourceSuite is
   ```
   [timestamp, partition, timestampType, key, offset, topic, value]
   ```
   After the changes in SPARK-31292, the groupCols changed to
   ```
   [key, value, topic, partition, offset, timestamp, timestampType]
   ```
   
    Why this incompatibility bug didn't fail the 
`KafkaMicroBatchV2SourceSuite` when it merged?
   
   Because the UT `default config of includeHeader doesn't break the existing 
query from Spark 2.4` didn't test the scenario of duplicating and check the 
answer.
   Although the UT uses the checkpoint written by version 2.4.3 and streaming 
duplicate operation, it just wants to prove that the new header(added in 
SPARK-23539) doesn't break the original checkpoint file. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store

2020-06-05 Thread GitBox


xuanyuanking commented on pull request #28707:
URL: https://github.com/apache/spark/pull/28707#issuecomment-639811979


   All the comments addressed in 1f71563. Thanks for the review!
   It also includes my alternative of adding the invalidation for all state 
store operations in StateStoreProvider, PTAL.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store

2020-06-04 Thread GitBox


xuanyuanking commented on pull request #28707:
URL: https://github.com/apache/spark/pull/28707#issuecomment-638985570


   @HeartSaVioR After taking a further look. Instead of dealing with the 
iterator, how about adding the invalidation for all state store operations in 
`StateStoreProvider`? Since we can get the key/value row during load map. WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store

2020-06-04 Thread GitBox


xuanyuanking commented on pull request #28707:
URL: https://github.com/apache/spark/pull/28707#issuecomment-638679010


   @skambha You can check the integrated tests in #28725. If we delete the 
validation, we'll get a NPE for [this 
test](https://github.com/apache/spark/pull/28725/files#diff-492f0d70824a58ef2ea94a54dc6f9707R79),
 and get an assertion in the unsafe row for [this 
test](https://github.com/apache/spark/pull/28725/files#diff-492f0d70824a58ef2ea94a54dc6f9707R185).
 That is to say, we will get random failures during reusing the checkpoint 
written by the old Spark version.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store

2020-06-03 Thread GitBox


xuanyuanking commented on pull request #28707:
URL: https://github.com/apache/spark/pull/28707#issuecomment-638179692


   @rednaxelafx Great thanks for the detailed comment and guidance. I'm 
addressing these comments.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store

2020-06-03 Thread GitBox


xuanyuanking commented on pull request #28707:
URL: https://github.com/apache/spark/pull/28707#issuecomment-638072075


   > @skambha it doesn't fix the issue, it gives a better error message when we 
hit the issue.
   
   Yep, I WIP for the integrated test of the state store format invalidation. I 
will show you the difference with/ this patch on the error message.
   
   > Safety guards must be placed in front of this - like SPARK-27237, which I 
think it covers various general issues with providing clearer guide of schema 
incompatibility between state and the query being run.
   
   Yes, the two approach addresses different sides of this issue, SPARK-27237 
require an extra file to keep the schema, which can make the schema checking 
possible. This one is a guard for random failure or correctness bug.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store

2020-06-02 Thread GitBox


xuanyuanking commented on pull request #28707:
URL: https://github.com/apache/spark/pull/28707#issuecomment-637733113


   cc @zsxwing @rednaxelafx @cloud-fan @HeartSaVioR 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org