[Spark SQL]: Crash when attempting to select PostgreSQL bpchar without length specifier in Spark 3.5.0

Lily Hahn Mon, 29 Jan 2024 08:45:30 -0800

Hi,

I’m currently migrating an ETL project to Spark 3.5.0 from 3.2.1 and ran into 
an issue with some of our queries that read from PostgreSQL databases.


Any attempt to run a Spark SQL query that selects a bpchar without a length 
specifier from the source DB seems to crash:
py4j.protocol.Py4JJavaError: An error occurred while calling 
o1061.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0) (192.168.1.48 executor driver): java.lang.OutOfMemoryError: Requested array 
size exceeds VM limit
        at org.apache.spark.unsafe.types.UTF8String.rpad(UTF8String.java:880)
        at 
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.readSidePadding(CharVarcharCodegenUtils.java:62)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
        at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
        at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:141)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
        at 
org.apache.spark.executor.Executor$TaskRunner$$Lambda$2882/0x000000080124d840.apply(Unknown
 Source)
        at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

This appears to be the plan step related to the traceback:
staticinvoke(class org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, 
StringType, readSidePadding, bpcharcolumn#7, 2147483647, true, false, true)

Reading from a subquery and casting the column to varchar, appears to work 
around the issue.

In PostgreSQL, the bpchar type acts as a variable unlimited length, 
blank-trimmed string if the length is omitted from the definition.

Is this an issue with Spark? I think the column is incorrectly getting 
interpreted as a char, which behaves the same way as a bpchar(n).

[Spark SQL]: Crash when attempting to select PostgreSQL bpchar without length specifier in Spark 3.5.0

Reply via email to