[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value
xkrogen commented on PR #37634: URL: https://github.com/apache/spark/pull/37634#issuecomment-1314561335 Thanks for your comments @cloud-fan. If it were possible to handle all data types using the exception-catching approach, I would still advocate for it. But you raised a good point that it's not possible to handle primitives in this manner, and this is actually an even more severe issue, because it becomes a correctness issue (NULL silently replaced with 0) instead of just a confusing error message. My initial next attempt, following along your suggestion regarding `AssertNotNull`, was going to be along these lines: - Create a new unary expression `UntrustedNullGuard`. Behaviorally, it is identical to `AssertNotNull`, except for maybe a different error message. - In contrast to `AssertNotNull` which gets optimized away in cases of non-nullable input, `UntrustedNullGuard` gets optimized away _in cases of nullable input_. - We look for untrusted sources (input data sources, UDFs, etc.) and wrap them in `UntrustedNullGuard` without concern for whether the data source returns a null or non-null type--if it's nullable, the check will be optimized away, and if it's not, then we get the guard as expected. - We can have a flag like `spark.sql.optimizer.guardUntrustedNullability` which defaults to true, but can be set to false by those wishing to indicate to Spark that they really do trust their data sources and don't want the runtime checks added. (The optimizer rule won't apply in this case.) However this approach has some drawbacks. The one I found to be a fatal flaw is that there are situations where the operator itself does some unpacking of the result value and will already replace a NULL value with a default, e.g. here: https://github.com/apache/spark/blob/3f97cd620a61bc2685bf35215185365e63bedc0a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L1176-L1180 I don't think there's any way to catch this situation except to modify the codegen for the operators themselves. I've put up an alternate approach demonstrating this in PR #38660; please take a look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value
xkrogen commented on PR #37634: URL: https://github.com/apache/spark/pull/37634#issuecomment-1251319065 Thanks for the suggestion @cloud-fan ! Good point about there many places where Spark trusts nullability. Here I am trying to target places where _user code_ could introduce a null. This includes data sources (including in-memory dataset creation like `sparkContext.parallelize()` or `DatasetHolder.toDS()`) and UDFs (including lambdas in calls to DF/DS APIs). User code is inherently untrustworthy, so more checks are warranted. I think any of these places where user code supplies values would be covered by this PR since they all go through projection before being accessed by other codepaths, but LMK if you disagree. I guess one situation could be if the optimizer completely removes some operator because of the nullability, so the data is never even accessed--are you thinking about situations like this? Unfortunately we cannot use `AssertNotNull` for this; the optimizer will remove it if the input schema indicates that the schema is non-null: https://github.com/apache/spark/blob/96831bbb6749910c8f9497c048ba05e6e317649f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L826 And that optimizer rule is there for a good reason; there is nontrivial overhead associated with the null check as discussed further in [my other comment](https://github.com/apache/spark/pull/37634#discussion_r974499166). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value
xkrogen commented on PR #37634: URL: https://github.com/apache/spark/pull/37634#issuecomment-1239693536 Ping again to @MaxGekk @karenfeng @gengliangwang -- any thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value
xkrogen commented on PR #37634: URL: https://github.com/apache/spark/pull/37634#issuecomment-1233486676 Pushed up new commits rebasing on latest changes. cc @cloud-fan @dongjoon-hyun in case either of you are interested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value
xkrogen commented on PR #37634: URL: https://github.com/apache/spark/pull/37634#issuecomment-1225021291 cc @shardulm94 (thanks for the push towards constructing the path into a single message instead of using chained exceptions, the additional code changes were minimal and the resulting message is much nicer) and a few folks who have been working on error reporting @MaxGekk @karenfeng @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org