Hey,
I just posted this question on Stack Overflow (link 
here<http://stackoverflow.com/questions/41362314/pyspark-streaming-job-avoid-object-serialization>)
 and decided to try my luck here as well :)


I'm writing a PySpark job but I got into some performance issues. Basically, 
all it does is read events from Kafka and logs the transformations made. Thing 
is, the transformation is calculated based on an object's function, and that 
object is pretty heavy as it contains a Graph and an inner-cache which gets 
automatically updated as it processes rdd's. So when I write the following 
piece of code:

analyzer = ShortTextAnalyzer(root_dir)

logger.info("Start analyzing the documents from kafka")

ssc.union(*streams).filter(lambda x: x[1] != None).foreachRDD(lambda rdd: 
rdd.foreach(lambda record: analyzer.analyze_short_text_event(record[1])))



It serializes my analyzer which takes a lot of time because of the graph, and 
as it is copied to the executor, the cache is only relevant for that specific 
RDD.

If the job was written in Scala, I could have written an Object which would 
exist in every executor and then my object wouldn't have to be serialized each 
time.

I've read in a post (http://www.spark.tc/deserialization-in-pyspark-storage/) 
that prior to PySpark 2.0, objects are always serialized. So does that mean 
that I have no way to avoid the serialization?

I'd love to hear about a way to avoid serialization in PySpark if it exists. To 
have my object created once for each executor and then it could avoid the 
serialization process, gain time and actually have a working cache system?

Thanks in advance :)
Sidney Feiner   /  SW Developer
M: +972.528197720  /  Skype: sidney.feiner.startapp

[StartApp]<http://www.startapp.com/>

Reply via email to