[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
xuanyuanking commented on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-646497299 Thanks all for reviewing! I'll review #24173 as the next step for schema validation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
xuanyuanking commented on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-644020172 ``` Sorry my comment was edited so you may be missed the content, but it is also a sort of pointing out for "pinpointing" - do you think your approach works with other state store providers as well? ``` @HeartSaVioR I totally agree with your approach is clean enough for all the state store provider, that's also what I want to achieve. But as you may find that the new regression bug SPARK-31990 is found by key validation, that's why we can't directly skip it. Anyway, super thanks for the proposal, let's try to find another way together which matches all the requirements, without changing the provider API. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
xuanyuanking commented on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-643916110 A new regression bug SPARK-31990 was found when investigating the test failure https://github.com/apache/spark/pull/28707#issuecomment-639861273. The root cause is that [this line](https://github.com/apache/spark/pull/28062/files#diff-7a46f10c3cedbf013cf255564d9483cdL2458) in SPARK-31292 made the order of groupCols in Deduplicate changed, and the order changing will break the validation logic here. That is to say, if we don't have this PR, the executor JVM could probably crash, throw a random exception, or even return a wrong answer when using the checkpoint written by the previous version. So we have 2 related work of this PR: - [ ] Fix and merge the compatibility issue in #28830 - [ ] Add new test(or modify the current Kafka test) in #28725 -- ### More detailed analysis: The expected order of `Deduplicate.groupCols` in UT KafkaMicroBatchV2SourceSuite is ``` [timestamp, partition, timestampType, key, offset, topic, value] ``` After the changes in SPARK-31292, the groupCols changed to ``` [key, value, topic, partition, offset, timestamp, timestampType] ``` Why this incompatibility bug didn't fail the `KafkaMicroBatchV2SourceSuite` when it merged? Because the UT `default config of includeHeader doesn't break the existing query from Spark 2.4` didn't test the scenario of duplicating and check the answer. Although the UT uses the checkpoint written by version 2.4.3 and streaming duplicate operation, it just wants to prove that the new header(added in SPARK-23539) doesn't break the original checkpoint file. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
xuanyuanking commented on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-639811979 All the comments addressed in 1f71563. Thanks for the review! It also includes my alternative of adding the invalidation for all state store operations in StateStoreProvider, PTAL. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
xuanyuanking commented on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-638985570 @HeartSaVioR After taking a further look. Instead of dealing with the iterator, how about adding the invalidation for all state store operations in `StateStoreProvider`? Since we can get the key/value row during load map. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
xuanyuanking commented on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-638679010 @skambha You can check the integrated tests in #28725. If we delete the validation, we'll get a NPE for [this test](https://github.com/apache/spark/pull/28725/files#diff-492f0d70824a58ef2ea94a54dc6f9707R79), and get an assertion in the unsafe row for [this test](https://github.com/apache/spark/pull/28725/files#diff-492f0d70824a58ef2ea94a54dc6f9707R185). That is to say, we will get random failures during reusing the checkpoint written by the old Spark version. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
xuanyuanking commented on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-638179692 @rednaxelafx Great thanks for the detailed comment and guidance. I'm addressing these comments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
xuanyuanking commented on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-638072075 > @skambha it doesn't fix the issue, it gives a better error message when we hit the issue. Yep, I WIP for the integrated test of the state store format invalidation. I will show you the difference with/ this patch on the error message. > Safety guards must be placed in front of this - like SPARK-27237, which I think it covers various general issues with providing clearer guide of schema incompatibility between state and the query being run. Yes, the two approach addresses different sides of this issue, SPARK-27237 require an extra file to keep the schema, which can make the schema checking possible. This one is a guard for random failure or correctness bug. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #28707: [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
xuanyuanking commented on pull request #28707: URL: https://github.com/apache/spark/pull/28707#issuecomment-637733113 cc @zsxwing @rednaxelafx @cloud-fan @HeartSaVioR This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org