Is Spark rdd.toDF() thread-safe?

yujhe.li Wed, 17 Mar 2021 18:15:09 -0700

Hi, 

I have an application that runs in a Spark-2.4.4 cluster and it transforms
two RDD to DataFrame with `rdd.toDF()` then outputs them to file.


For slave resource usage optimization, the application executes the job in
multi-thread. The code snippet looks like this:



And I found that `toDF()` is not thread-safe. The application failed
sometimes by `java.lang.UnsupportedOperationException`.
You can reproduce it from the following code snippet (1% to happen, it's
easier to happen when the case class has a large number of fields.):






You may get a similar exception message: `Schema for type A is not
supported`



>From the error message, it was caused by `ScalaReflection.schemaFor()`. 
I had looked at the code, it seems like Spark uses Scala reflection to get
the data type and as I know there is a concurrency issue in Scala
reflection. 

SPARK-26555 <https://issues.apache.org/jira/browse/SPARK-26555>  
thread-safety-in-scala-reflection-with-type-matching
<https://stackoverflow.com/questions/55150590/thread-safety-in-scala-reflection-with-type-matching>
  

Should we fix it? I can not find any document about thread-safe in creating
DataFrame.

I had workarounded this by adding a lock when transforming RDD to DataFrame.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Is Spark rdd.toDF() thread-safe?

Reply via email to