Yeoul,

I think a you can run an microbench for pyspark
serialization/deserialization would be to run a withColumn + a python udf
that returns a constant and compare that with similar code in
Scala.

I am not sure if there is way to measure just the serialization code,
because pyspark API only allows you apply a python function over the data
frame so that always involve running a for loop in python over the data.
You probably need to some hacking to make it just do the serialization.

Maybe other people have more insights?

Li


On Tue, Mar 7, 2017 at 9:18 PM Yeoul Na <yeo...@uci.edu> wrote:


Hi all,

I am trying to analyze PySpark performance overhead. People just say PySpark
is slower than Scala due to the Serialization/Deserialization overhead. I
tried with the example in this post:
https://0x0fff.com/spark-dataframes-are-faster-arent-they/. This and many
articles say straight-forward Python implementation is the slowest due to
the serialization/deserialization overhead.

However, when I actually looked at the log in the Web UI, serialization and
deserialization time of PySpark do not seem to be any bigger than that of
Scala. The main contributor was "Executor Computing Time". Thus, we cannot
sure whether this is due to serialization or because Python code is
basically slower than Scala code.

So my question is that does "Task Deserialization Time" in Spark WebUI
actually include serialization/deserialization times in PySpark? If this is
not the case, how can I actually measure the serialization/deserialization
overhead?

Thanks,
Yeoul



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Serialization-Deserialization-Pickling-Overhead-tp28468.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to