[GitHub] [spark] rednaxelafx edited a comment on issue #27544: [SPARK-30795][SQL] Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s()
rednaxelafx edited a comment on issue #27544: [SPARK-30795][SQL] Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s() URL: https://github.com/apache/spark/pull/27544#issuecomment-585498805 BTW, just want to mention that, before this PR, on Apache Spark master / branch-3.0, the following query would fail to compile the generated code, which is a regression. ```scala Seq(("abc", "%bc")).toDF("s", "p").repartition(1).selectExpr("s like p").queryExecution.debug.codegen ``` produces: ```java /* 041 */ java.util.regex.Pattern project_pattern_0 = java.util.regex.Pattern.compile( /* 042 */ org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(project_rightStr_0, '\')); ``` So if anybody has ever tried the non-constant pattern code path on master recently, it should have failed: ``` scala> Seq(("abc", "%bc")).toDF("s", "p").repartition(1).selectExpr("s like p").collect 20/02/13 01:05:30 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 42, Column 84: Closing single quote missing org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 42, Column 84: Closing single quote missing ``` That probably means we're still lacking test coverage for the regex expressions in Spark SQL. May be worth adding some tests before 3.0 goes out the door, e.g. DataFrame tests and/or SQL tests. (Don't know who to cc ... probably gonna be a long list? I'll cc the list of reviewers who already reviewed this PR: @viirya @maropu @kiszk @HyukjinKwon @cloud-fan and also @gatorsmile ) The reason why the RegexExpressionSuite unit test didn't catch the bug is because the codegen test framework isn't doing the right thing. https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala#L280-L294 `ExpressionEvalHelper.evaluateWithUnsafeProjection` was explicitly triggering codegen twice, with the intention that common-subexpression elimination (CSE) should NOT be active, and the expression should actually codegen twice to expose bugs. Yet what's actually happening is that CSE was triggered for these tests, which just so happens to allow the bug that this PR fixes dodge the bullet and slip through the tests: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L308-L315 In `GenerateUnsafeProjection`, there's a piece of logic that generates code from expressions, and then embeds the generated code into a `code"..."` block again, which will trigger this bug. But because CSE was active, the problematic code did not go through this code path, but instead got generated into a separate function which did not get wrapped into a `code"..."` again. That's how it (unfortunately) dodged the bullet. If the codegen tests either turned off CSE, or test with both CSE on and off, then this bug would have been exposed much earlier. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rednaxelafx edited a comment on issue #27544: [SPARK-30795][SQL] Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s()
rednaxelafx edited a comment on issue #27544: [SPARK-30795][SQL] Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s() URL: https://github.com/apache/spark/pull/27544#issuecomment-585498805 BTW, just want to mention that, before this PR, on Apache Spark master / branch-3.0, the following query would fail to compile the generated code, which is a regression. ```scala Seq(("abc", "%bc")).toDF("s", "p").repartition(1).selectExpr("s like p").queryExecution.debug.codegen ``` produces: ```java /* 041 */ java.util.regex.Pattern project_pattern_0 = java.util.regex.Pattern.compile( /* 042 */ org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(project_rightStr_0, '\')); ``` So if anybody has ever tried the non-constant pattern code path on master recently, it should have failed: ``` scala> Seq(("abc", "%bc")).toDF("s", "p").repartition(1).selectExpr("s like p").collect 20/02/13 01:05:30 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 42, Column 84: Closing single quote missing org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 42, Column 84: Closing single quote missing ``` That probably means we're still lacking test coverage for the regex expressions in Spark SQL. May be worth adding some tests before 3.0 goes out the door, e.g. DataFrame tests and/or SQL tests. (Don't know who to cc ... probably gonna be a long list? I'll cc the list of reviewers who already reviewed this PR: @viirya @maropu @kiszk @HyukjinKwon @cloud-fan and also @gatorsmile ) The reason why the RegexExpressionSuite unit test didn't catch the bug is because the codegen test framework isn't doing the right thing. https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala#L280-L294 `ExpressionEvalHelper.evaluateWithUnsafeProjection` was explicitly triggering codegen twice, with the intention that common-subexpression elimination (CSE) should NOT be active, and the expression should actually codegen twice to expose bugs. Yet what's actually happening is that CSE was triggered for these tests, which just so happens to allow the bug that this PR fixes dodge the bullet and slip through the tests: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L308-L315 In `GenerateUnsafeProjection`, there's a piece of logic that generates code from expressions, and then embeds the generated code into a `code"..."` block again, which will trigger this bug. But because CSE was active, the problematic code did not go through this code path, but instead got generated into a separate function which did not get wrapped into a `code"..."` again. That's how it (unfortunately) dodged the bullet. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org