[Spark SQL] Why doesn't Spark SQL use code generation for my custom expression?

Han You Fri, 23 Jul 2021 08:18:10 -0700

Hello, I’m writing a custom Spark catalyst Expression with custom codegen,
but it seems that Spark (3.0.0) doesn’t want to generate code, and falls
back to interpreted mode.


I created my SparkSession with spark.sql.codegen.factoryMode=CODEGEN_ONLY
and spark.sql.codegen.fallback=false, hoping that I can force Spark to do
code generation. Then, I have this custom Expression with both interpreted
mode and codegen defined:

// Check if a string has whitespaces on either end
case class IsTrimmedExpr(child: Expression) extends UnaryExpression with
ExpectsInputTypes {
 override def inputTypes: Seq[DataType] = Seq(StringType)
 override lazy val dataType: DataType = BooleanType

 override protected def doGenCode(ctx: CodegenContext, ev: ExprCode):
ExprCode = {
   throw new RuntimeException(“expected code gen”)
   nullSafeCodeGen(ctx, ev, input => s”($input.trim().equals($input))“)
 }

 override protected def nullSafeEval(input: Any): Any = {
   throw new RuntimeException(“should not eval”)
   val str = input.asInstanceOf[org.apache.spark.unsafe.types.UTF8String]
   str.trim.equals(str)
 }
}

which I register into the session’s registry (I know this is undocumented
API, but since the registry is able to find the expression, it shouldn't
affect the execution stage, right?):

spark.sessionState.functionRegistry.registerFunction(
 FunctionIdentifier(“is_trimmed”), {
   case Seq(s) => IsTrimmedExpr(s)
 }
)

To invoke the function/Expression, I do

val df = Seq(”  abc”, “def”, “56 “, ” 123 “, “what is a trim”).toDF(“word”)
df.selectExpr(“word”, “is_trimmed(word)“).show()

The expected behavior is to get an exception from doGenCode(). But instead
I got the exception from the nullSafeEval() function which should not be
run at all.

How do I force Spark to use codegen mode?

Best,
Han

[Spark SQL] Why doesn't Spark SQL use code generation for my custom expression?

Reply via email to