[GitHub] [spark] rednaxelafx edited a comment on issue #27544: [SPARK-30795][SQL] Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s()

2020-02-12 Thread GitBox
rednaxelafx edited a comment on issue #27544: [SPARK-30795][SQL] Spark SQL 
codegen's code() interpolator should treat escapes like Scala's 
StringContext.s()
URL: https://github.com/apache/spark/pull/27544#issuecomment-585498805
 
 
   BTW, just want to mention that, before this PR, on Apache Spark master / 
branch-3.0, the following query would fail to compile the generated code, which 
is a regression.
   
   ```scala
   Seq(("abc", "%bc")).toDF("s", "p").repartition(1).selectExpr("s like 
p").queryExecution.debug.codegen
   ```
   produces:
   ```java
   /* 041 */   java.util.regex.Pattern project_pattern_0 = 
java.util.regex.Pattern.compile(
   /* 042 */ 
org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(project_rightStr_0,
 '\'));
   ```
   So if anybody has ever tried the non-constant pattern code path on master 
recently, it should have failed:
   ```
   scala> Seq(("abc", "%bc")).toDF("s", "p").repartition(1).selectExpr("s like 
p").collect
   20/02/13 01:05:30 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 42, 
Column 84: Closing single quote missing
   org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
42, Column 84: Closing single quote missing
   ```
   
   That probably means we're still lacking test coverage for the regex 
expressions in Spark SQL. May be worth adding some tests before 3.0 goes out 
the door, e.g. DataFrame tests and/or SQL tests.
   
   (Don't know who to cc ... probably gonna be a long list? I'll cc the list of 
reviewers who already reviewed this PR:
   @viirya @maropu @kiszk @HyukjinKwon @cloud-fan and also @gatorsmile 
   )
   
   The reason why the RegexExpressionSuite unit test didn't catch the bug is 
because the codegen test framework isn't doing the right thing.
   
https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala#L280-L294
   `ExpressionEvalHelper.evaluateWithUnsafeProjection` was explicitly 
triggering codegen twice, with the intention that common-subexpression 
elimination (CSE) should NOT be active, and the expression should actually 
codegen twice to expose bugs.
   
   Yet what's actually happening is that CSE was triggered for these tests, 
which just so happens to allow the bug that this PR fixes dodge the bullet and 
slip through the tests:
   
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L308-L315
   In `GenerateUnsafeProjection`, there's a piece of logic that generates code 
from expressions, and then embeds the generated code into a `code"..."` block 
again, which will trigger this bug.
   But because CSE was active, the problematic code did not go through this 
code path, but instead got generated into a separate function which did not get 
wrapped into a `code"..."` again. That's how it (unfortunately) dodged the 
bullet.
   
   If the codegen tests either turned off CSE, or test with both CSE on and 
off, then this bug would have been exposed much earlier.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] rednaxelafx edited a comment on issue #27544: [SPARK-30795][SQL] Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s()

2020-02-12 Thread GitBox
rednaxelafx edited a comment on issue #27544: [SPARK-30795][SQL] Spark SQL 
codegen's code() interpolator should treat escapes like Scala's 
StringContext.s()
URL: https://github.com/apache/spark/pull/27544#issuecomment-585498805
 
 
   BTW, just want to mention that, before this PR, on Apache Spark master / 
branch-3.0, the following query would fail to compile the generated code, which 
is a regression.
   
   ```scala
   Seq(("abc", "%bc")).toDF("s", "p").repartition(1).selectExpr("s like 
p").queryExecution.debug.codegen
   ```
   produces:
   ```java
   /* 041 */   java.util.regex.Pattern project_pattern_0 = 
java.util.regex.Pattern.compile(
   /* 042 */ 
org.apache.spark.sql.catalyst.util.StringUtils.escapeLikeRegex(project_rightStr_0,
 '\'));
   ```
   So if anybody has ever tried the non-constant pattern code path on master 
recently, it should have failed:
   ```
   scala> Seq(("abc", "%bc")).toDF("s", "p").repartition(1).selectExpr("s like 
p").collect
   20/02/13 01:05:30 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 42, 
Column 84: Closing single quote missing
   org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
42, Column 84: Closing single quote missing
   ```
   
   That probably means we're still lacking test coverage for the regex 
expressions in Spark SQL. May be worth adding some tests before 3.0 goes out 
the door, e.g. DataFrame tests and/or SQL tests.
   
   (Don't know who to cc ... probably gonna be a long list? I'll cc the list of 
reviewers who already reviewed this PR:
   @viirya @maropu @kiszk @HyukjinKwon @cloud-fan and also @gatorsmile 
   )
   
   The reason why the RegexExpressionSuite unit test didn't catch the bug is 
because the codegen test framework isn't doing the right thing.
   
https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala#L280-L294
   `ExpressionEvalHelper.evaluateWithUnsafeProjection` was explicitly 
triggering codegen twice, with the intention that common-subexpression 
elimination (CSE) should NOT be active, and the expression should actually 
codegen twice to expose bugs.
   
   Yet what's actually happening is that CSE was triggered for these tests, 
which just so happens to allow the bug that this PR fixes dodge the bullet and 
slip through the tests:
   
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L308-L315
   In `GenerateUnsafeProjection`, there's a piece of logic that generates code 
from expressions, and then embeds the generated code into a `code"..."` block 
again, which will trigger this bug.
   But because CSE was active, the problematic code did not go through this 
code path, but instead got generated into a separate function which did not get 
wrapped into a `code"..."` again. That's how it (unfortunately) dodged the 
bullet.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org