[GitHub] [spark] xkrogen commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

2022-09-19 Thread GitBox


xkrogen commented on code in PR #37634:
URL: https://github.com/apache/spark/pull/37634#discussion_r974499166


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:
##
@@ -252,28 +267,44 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
  """.stripMargin
   }
 
+  /**
+   * Wrap `inputExpr` in a try-catch block that will catch any 
[[NullPointerException]] that is
+   * thrown, instead throwing a (more helpful) error message as provided by
+   * 
[[org.apache.spark.sql.errors.QueryExecutionErrors.valueCannotBeNullError]].
+   */
+  private def wrapWithNpeHandling(inputExpr: String, descPath: Seq[String]): 
String =
+s"""
+   |try {
+   |  ${inputExpr.trim}

Review Comment:
   I prefer exception-catching as it handles this issue with zero overhead. 
Adding a null-check here essentially falls back to the logic for a nullable 
schema:
   
https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L119-L133
   From the benchmark results, we can see that there is nontrivial overhead for 
the null-check; for the simple case of a projection of a primitive, the 
overhead is almost 50%:
   
https://github.com/apache/spark/blob/2a1f9767213c321bd52e7714fa3b5bfc4973ba40/sql/catalyst/benchmarks/UnsafeProjectionBenchmark-jdk17-results.txt#L9-L10
   
   You call out the situation of a null silently being replaced with a default 
value; this is a good point. I'm not sure how we can handle that without 
additional overhead of an explicit check. It seems that the default value 
replacement logic is coming from [Scala's own unboxing 
logic](https://github.com/scala/scala/blob/986dcc160aab85298f6cab0bf8dd0345497cdc01/src/library/scala/runtime/BoxesRunTime.java#L102).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xkrogen commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

2022-08-31 Thread GitBox


xkrogen commented on code in PR #37634:
URL: https://github.com/apache/spark/pull/37634#discussion_r960070035


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:
##
@@ -252,28 +266,44 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
  """.stripMargin
   }
 
+  /**
+   * Wrap `inputExpr` in a try-catch block that will catch any 
[[NullPointerException]] that is
+   * thrown, instead throwing a (more helpful) error message as provided by
+   * 
[[org.apache.spark.sql.errors.QueryExecutionErrors.valueCannotBeNullError]].
+   */
+  private def wrapWithNpeHandling(inputExpr: String, descPath: Seq[String]): 
String =
+s"""
+   |try {
+   |  ${inputExpr.trim}
+   |} catch (NullPointerException npe) {
+   |  throw QueryExecutionErrors.valueCannotBeNullError(

Review Comment:
   Printing the single datum won't be helpful since it's always NULL, and it 
would be challenging to access the whole input row from this location. We 
create the projection recursively, so at this point while recursing, we don't 
even have a reference to the fully created projection to grab the other fields. 
Note also that this is a failure to project the data, and we would also need to 
project the data to print it, so we'd have to selectively skip this field.
   
   Open to suggestions, but I don't see a clear path forward.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xkrogen commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

2022-08-25 Thread GitBox


xkrogen commented on code in PR #37634:
URL: https://github.com/apache/spark/pull/37634#discussion_r955234541


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:
##
@@ -252,28 +264,43 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
  """.stripMargin
   }
 
+  /**
+   * Wrap `inputExpr` in a try-catch block that will catch any 
[[NullPointerException]] that is
+   * thrown, instead throwing a (more helpful) error message as provided by
+   * 
[[org.apache.spark.sql.errors.QueryExecutionErrors.valueCannotBeNullError]].
+   */
+  private def wrapWithNpeHandling(inputExpr: String, descPath: Seq[String]): 
String =
+s"""
+   |try {
+   |  ${inputExpr.trim}
+   |} catch (NullPointerException npe) {
+   |  throw 
QueryExecutionErrors.valueCannotBeNullError("${descPath.mkString(".")}");

Review Comment:
   Good catch! I can't believe the ridiculous stuff Spark will accept as a 
valid column name. Fixed and added a test for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xkrogen commented on a diff in pull request #37634: [SPARK-40199][SQL] Provide useful error when projecting a non-null column encounters null value

2022-08-25 Thread GitBox


xkrogen commented on code in PR #37634:
URL: https://github.com/apache/spark/pull/37634#discussion_r955182512


##
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##
@@ -447,6 +447,13 @@ private[sql] object QueryExecutionErrors extends 
QueryErrorsBase {
 new RuntimeException(fieldCannotBeNullMsg(index, fieldName))
   }
 
+  def valueCannotBeNullError(locationDesc: String): RuntimeException = {
+new RuntimeException(s"The value at $locationDesc cannot be null, but a 
NULL was found. " +
+  "This is typically caused by the presence of a NULL value when the 
schema indicates the " +
+  "value should be non-null. Check that the input data matches the schema 
and/or that UDFs " +

Review Comment:
   Yeah, it handles this situation. Marking a UDF as non-nullable just adjusts 
the schema, then the output row will contain a null value -- so the situation 
is identical to what is already tested in `GeneratedProjectionSuite`. But I can 
add this to `DataFrameSuite` to explicitly demonstrate that it is covered.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org