[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

2022-11-14 Thread GitBox


xkrogen commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1314561335

   Thanks for your comments @cloud-fan. If it were possible to handle all data 
types using the exception-catching approach, I would still advocate for it. But 
you raised a good point that it's not possible to handle primitives in this 
manner, and this is actually an even more severe issue, because it becomes a 
correctness issue (NULL silently replaced with 0) instead of just a confusing 
error message.
   
   My initial next attempt, following along your suggestion regarding 
`AssertNotNull`, was going to be along these lines:
   - Create a new unary expression `UntrustedNullGuard`. Behaviorally, it is 
identical to `AssertNotNull`, except for maybe a different error message.
   - In contrast to `AssertNotNull` which gets optimized away in cases of 
non-nullable input, `UntrustedNullGuard` gets optimized away _in cases of 
nullable input_.
   - We look for untrusted sources (input data sources, UDFs, etc.) and wrap 
them in `UntrustedNullGuard` without concern for whether the data source 
returns a null or non-null type--if it's nullable, the check will be optimized 
away, and if it's not, then we get the guard as expected.
   - We can have a flag like `spark.sql.optimizer.guardUntrustedNullability` 
which defaults to true, but can be set to false by those wishing to indicate to 
Spark that they really do trust their data sources and don't want the runtime 
checks added. (The optimizer rule won't apply in this case.)
   
   However this approach has some drawbacks. The one I found to be a fatal flaw 
is that there are situations where the operator itself does some unpacking of 
the result value and will already replace a NULL value with a default, e.g. 
here:
   
https://github.com/apache/spark/blob/3f97cd620a61bc2685bf35215185365e63bedc0a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L1176-L1180
   
   I don't think there's any way to catch this situation except to modify the 
codegen for the operators themselves. I've put up an alternate approach 
demonstrating this in PR #38660; please take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

2022-09-19 Thread GitBox


xkrogen commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1251319065

   Thanks for the suggestion @cloud-fan ! Good point about there many places 
where Spark trusts nullability. Here I am trying to target places where _user 
code_ could introduce a null. This includes data sources (including in-memory 
dataset creation like `sparkContext.parallelize()` or `DatasetHolder.toDS()`) 
and UDFs (including lambdas in calls to DF/DS APIs). User code is inherently 
untrustworthy, so more checks are warranted. I think any of these places where 
user code supplies values would be covered by this PR since they all go through 
projection before being accessed by other codepaths, but LMK if you disagree. I 
guess one situation could be if the optimizer completely removes some operator 
because of the nullability, so the data is never even accessed--are you 
thinking about situations like this?
   
   Unfortunately we cannot use `AssertNotNull` for this; the optimizer will 
remove it if the input schema indicates that the schema is non-null:
   
https://github.com/apache/spark/blob/96831bbb6749910c8f9497c048ba05e6e317649f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L826
   And that optimizer rule is there for a good reason; there is nontrivial 
overhead associated with the null check as discussed further in [my other 
comment](https://github.com/apache/spark/pull/37634#discussion_r974499166).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

2022-09-07 Thread GitBox


xkrogen commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1239693536

   Ping again to @MaxGekk @karenfeng @gengliangwang -- any thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

2022-08-31 Thread GitBox


xkrogen commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1233486676

   Pushed up new commits rebasing on latest changes.
   
   cc @cloud-fan @dongjoon-hyun in case either of you are interested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xkrogen commented on pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

2022-08-23 Thread GitBox


xkrogen commented on PR #37634:
URL: https://github.com/apache/spark/pull/37634#issuecomment-1225021291

   cc @shardulm94 (thanks for the push towards constructing the path into a 
single message instead of using chained exceptions, the additional code changes 
were minimal and the resulting message is much nicer)
   and a few folks who have been working on error reporting @MaxGekk @karenfeng 
@gengliangwang


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org