Thanks everybody but I've found another way of doing it. Because I didn't really actually need an instance of my class, I created a "static" class. All variables get initiated as class variables and all methods are class methods. Thanks a lot anyways, hope my answer will also help one day ☺
Sidney Feiner / SW Developer M: +972.528197720 / Skype: sidney.feiner.startapp [StartApp]<http://www.startapp.com/> From: Holden Karau [mailto:hol...@pigscanfly.ca] Sent: Thursday, December 29, 2016 8:54 PM To: Chawla,Sumit <sumitkcha...@gmail.com>; Eike von Seggern <eike.segg...@sevenval.com> Cc: Sidney Feiner <sidney.fei...@startapp.com>; user@spark.apache.org Subject: Re: [PySpark - 1.6] - Avoid object serialization Alternatively, using the broadcast functionality can also help with this. On Thu, Dec 29, 2016 at 3:05 AM Eike von Seggern <eike.segg...@sevenval.com<mailto:eike.segg...@sevenval.com>> wrote: 2016-12-28 20:17 GMT+01:00 Chawla,Sumit <sumitkcha...@gmail.com<mailto:sumitkcha...@gmail.com>>: Would this work for you? def processRDD(rdd): analyzer = ShortTextAnalyzer(root_dir) rdd.foreach(lambda record: analyzer.analyze_short_text_event(record[1])) ssc.union(*streams).filter(lambda x: x[1] != None) .foreachRDD(lambda rdd: processRDD(rdd)) I think, this will still send each analyzer to all executors where rdd partitions are stored. Maybe you can work around this with `RDD.foreachPartition()`: def processRDD(rdd): def partition_func(records): analyzer = ShortTextAnalyzer(root_dir) for record in records: analyzer.analyze_short_text_event(record[1]) rdd.foreachPartition(partition_func) This will create one analyzer per partition and RDD. Best Eike