Yeoul, I think a you can run an microbench for pyspark serialization/deserialization would be to run a withColumn + a python udf that returns a constant and compare that with similar code in Scala.
I am not sure if there is way to measure just the serialization code, because pyspark API only allows you apply a python function over the data frame so that always involve running a for loop in python over the data. You probably need to some hacking to make it just do the serialization. Maybe other people have more insights? Li On Tue, Mar 7, 2017 at 9:18 PM Yeoul Na <yeo...@uci.edu> wrote: Hi all, I am trying to analyze PySpark performance overhead. People just say PySpark is slower than Scala due to the Serialization/Deserialization overhead. I tried with the example in this post: https://0x0fff.com/spark-dataframes-are-faster-arent-they/. This and many articles say straight-forward Python implementation is the slowest due to the serialization/deserialization overhead. However, when I actually looked at the log in the Web UI, serialization and deserialization time of PySpark do not seem to be any bigger than that of Scala. The main contributor was "Executor Computing Time". Thus, we cannot sure whether this is due to serialization or because Python code is basically slower than Scala code. So my question is that does "Task Deserialization Time" in Spark WebUI actually include serialization/deserialization times in PySpark? If this is not the case, how can I actually measure the serialization/deserialization overhead? Thanks, Yeoul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Serialization-Deserialization-Pickling-Overhead-tp28468.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org