[GitHub] [spark] xkrogen commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value
xkrogen commented on code in PR #37634: URL: https://github.com/apache/spark/pull/37634#discussion_r974499166 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala: ## @@ -252,28 +267,44 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro """.stripMargin } + /** + * Wrap `inputExpr` in a try-catch block that will catch any [[NullPointerException]] that is + * thrown, instead throwing a (more helpful) error message as provided by + * [[org.apache.spark.sql.errors.QueryExecutionErrors.valueCannotBeNullError]]. + */ + private def wrapWithNpeHandling(inputExpr: String, descPath: Seq[String]): String = +s""" + |try { + | ${inputExpr.trim} Review Comment: I prefer exception-catching as it handles this issue with zero overhead. Adding a null-check here essentially falls back to the logic for a nullable schema: https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L119-L133 From the benchmark results, we can see that there is nontrivial overhead for the null-check; for the simple case of a projection of a primitive, the overhead is almost 50%: https://github.com/apache/spark/blob/2a1f9767213c321bd52e7714fa3b5bfc4973ba40/sql/catalyst/benchmarks/UnsafeProjectionBenchmark-jdk17-results.txt#L9-L10 You call out the situation of a null silently being replaced with a default value; this is a good point. I'm not sure how we can handle that without additional overhead of an explicit check. It seems that the default value replacement logic is coming from [Scala's own unboxing logic](https://github.com/scala/scala/blob/986dcc160aab85298f6cab0bf8dd0345497cdc01/src/library/scala/runtime/BoxesRunTime.java#L102). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xkrogen commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value
xkrogen commented on code in PR #37634: URL: https://github.com/apache/spark/pull/37634#discussion_r960070035 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala: ## @@ -252,28 +266,44 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro """.stripMargin } + /** + * Wrap `inputExpr` in a try-catch block that will catch any [[NullPointerException]] that is + * thrown, instead throwing a (more helpful) error message as provided by + * [[org.apache.spark.sql.errors.QueryExecutionErrors.valueCannotBeNullError]]. + */ + private def wrapWithNpeHandling(inputExpr: String, descPath: Seq[String]): String = +s""" + |try { + | ${inputExpr.trim} + |} catch (NullPointerException npe) { + | throw QueryExecutionErrors.valueCannotBeNullError( Review Comment: Printing the single datum won't be helpful since it's always NULL, and it would be challenging to access the whole input row from this location. We create the projection recursively, so at this point while recursing, we don't even have a reference to the fully created projection to grab the other fields. Note also that this is a failure to project the data, and we would also need to project the data to print it, so we'd have to selectively skip this field. Open to suggestions, but I don't see a clear path forward. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xkrogen commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value
xkrogen commented on code in PR #37634: URL: https://github.com/apache/spark/pull/37634#discussion_r955234541 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala: ## @@ -252,28 +264,43 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro """.stripMargin } + /** + * Wrap `inputExpr` in a try-catch block that will catch any [[NullPointerException]] that is + * thrown, instead throwing a (more helpful) error message as provided by + * [[org.apache.spark.sql.errors.QueryExecutionErrors.valueCannotBeNullError]]. + */ + private def wrapWithNpeHandling(inputExpr: String, descPath: Seq[String]): String = +s""" + |try { + | ${inputExpr.trim} + |} catch (NullPointerException npe) { + | throw QueryExecutionErrors.valueCannotBeNullError("${descPath.mkString(".")}"); Review Comment: Good catch! I can't believe the ridiculous stuff Spark will accept as a valid column name. Fixed and added a test for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xkrogen commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value
xkrogen commented on code in PR #37634: URL: https://github.com/apache/spark/pull/37634#discussion_r955182512 ## sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala: ## @@ -447,6 +447,13 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase { new RuntimeException(fieldCannotBeNullMsg(index, fieldName)) } + def valueCannotBeNullError(locationDesc: String): RuntimeException = { +new RuntimeException(s"The value at $locationDesc cannot be null, but a NULL was found. " + + "This is typically caused by the presence of a NULL value when the schema indicates the " + + "value should be non-null. Check that the input data matches the schema and/or that UDFs " + Review Comment: Yeah, it handles this situation. Marking a UDF as non-nullable just adjusts the schema, then the output row will contain a null value -- so the situation is identical to what is already tested in `GeneratedProjectionSuite`. But I can add this to `DataFrameSuite` to explicitly demonstrate that it is covered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org