I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am querying
to Mysql Database and applying

`*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working as
expected in spark 3.3.1 , but not working with 3.5.0.

Where Condition ::  `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR
upper(st) = 'REOPEN' OR upper(st) = 'CLOSED')*`

The *st *column is ENUM in the database and it is causing the issue.

Below is the Physical Plan of *FILTER* phase :

For 3.3.1 :

+- Filter ((upper(vn#11) = ERICSSON) AND (((upper(st#42) = OPEN) OR
(upper(st#42) = REOPEN)) OR (upper(st#42) = CLOSED)))

For 3.5.0 :

+- Filter ((upper(vn#11) = ERICSSON) AND (((upper(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true)) = OPEN) OR
(upper(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true)) = REOPEN)) OR
(upper(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true)) = CLOSED)))

-----------------------------------------------------------------

I have debug it and found that Spark added a property in version 3.4.0 ,
i.e. **spark.sql.readSideCharPadding** which has default value **true**.

Link to the JIRA : https://issues.apache.org/jira/browse/SPARK-40697

Added a new method in Class **CharVarcharCodegenUtils**

public static UTF8String readSidePadding(UTF8String inputStr, int limit) {
    int numChars = inputStr.numChars();
    if (numChars == limit) {
      return inputStr;
    } else if (numChars < limit) {
      return inputStr.rpad(limit, SPACE);
    } else {
      return inputStr;
    }
  }


**This method is appending some whitespace padding to the ENUM values while
reading and causing the Issue.**

-----------------------------------------------------------------------

When I am removing the UPPER function from the where condition the
**FILTER** Phase looks like this :

 +- Filter (((staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils,
 StringType, readSidePadding, st#42, 13, true, false, true) = OPEN
) OR (staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true) = REOPEN       )) OR
(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true) = CLOSED       ))


**You can see it has added some white space after the value and the query
runs fine giving the correct result.**

But with the UPPER function I am not getting the data.

------------------------------------------------------------------

I have also tried to disable this Property *spark.sql.readSideCharPadding =
false* with following cases :

1. With Upper function in where clause :
   It is not pushing the filters to Database and the *query works fine*.

  +- Filter (((upper(st#42) = OPEN) OR (upper(st#42) = REOPEN)) OR
(upper(st#42) = CLOSED))

2. But when I am removing the upper function

 *It is pushing the filter to Mysql with the white spaces and I am not
getting the data. (THIS IS A CAUSING VERY BIG ISSUE)*

  PushedFilters: [*IsNotNull(vn), *EqualTo(vn,ERICSSON),
*Or(Or(EqualTo(st,OPEN         ),EqualTo(st,REOPEN
)),EqualTo(st,CLOSED       ))]

I cannot move this filter to JDBC read query , also I can't remove this
UPPER function in the where clause.

--------------------------------------------------------------------------------------------------------

Also I found same data getting written to CASSANDRA with *PADDING .*

Reply via email to