Hey, I just posted this question on Stack Overflow (link here<http://stackoverflow.com/questions/41362314/pyspark-streaming-job-avoid-object-serialization>) and decided to try my luck here as well :)
I'm writing a PySpark job but I got into some performance issues. Basically, all it does is read events from Kafka and logs the transformations made. Thing is, the transformation is calculated based on an object's function, and that object is pretty heavy as it contains a Graph and an inner-cache which gets automatically updated as it processes rdd's. So when I write the following piece of code: analyzer = ShortTextAnalyzer(root_dir) logger.info("Start analyzing the documents from kafka") ssc.union(*streams).filter(lambda x: x[1] != None).foreachRDD(lambda rdd: rdd.foreach(lambda record: analyzer.analyze_short_text_event(record[1]))) It serializes my analyzer which takes a lot of time because of the graph, and as it is copied to the executor, the cache is only relevant for that specific RDD. If the job was written in Scala, I could have written an Object which would exist in every executor and then my object wouldn't have to be serialized each time. I've read in a post (http://www.spark.tc/deserialization-in-pyspark-storage/) that prior to PySpark 2.0, objects are always serialized. So does that mean that I have no way to avoid the serialization? I'd love to hear about a way to avoid serialization in PySpark if it exists. To have my object created once for each executor and then it could avoid the serialization process, gain time and actually have a working cache system? Thanks in advance :) Sidney Feiner / SW Developer M: +972.528197720 / Skype: sidney.feiner.startapp [StartApp]<http://www.startapp.com/>