Thanks everybody but I've found another way of doing it.
Because I didn't really actually need an instance of my class, I created a 
"static" class. All variables get initiated as class variables and all methods 
are class methods.
Thanks a lot anyways, hope my answer will also help one day ☺

Sidney Feiner   /  SW Developer
M: +972.528197720  /  Skype: sidney.feiner.startapp

[StartApp]<http://www.startapp.com/>

From: Holden Karau [mailto:hol...@pigscanfly.ca]
Sent: Thursday, December 29, 2016 8:54 PM
To: Chawla,Sumit <sumitkcha...@gmail.com>; Eike von Seggern 
<eike.segg...@sevenval.com>
Cc: Sidney Feiner <sidney.fei...@startapp.com>; user@spark.apache.org
Subject: Re: [PySpark - 1.6] - Avoid object serialization

Alternatively, using the broadcast functionality can also help with this.

On Thu, Dec 29, 2016 at 3:05 AM Eike von Seggern 
<eike.segg...@sevenval.com<mailto:eike.segg...@sevenval.com>> wrote:
2016-12-28 20:17 GMT+01:00 Chawla,Sumit 
<sumitkcha...@gmail.com<mailto:sumitkcha...@gmail.com>>:
Would this work for you?

def processRDD(rdd):
    analyzer = ShortTextAnalyzer(root_dir)
    rdd.foreach(lambda record: analyzer.analyze_short_text_event(record[1]))

ssc.union(*streams).filter(lambda x: x[1] != None)
.foreachRDD(lambda rdd: processRDD(rdd))

I think, this will still send each analyzer to all executors where rdd 
partitions are stored.

Maybe you can work around this with `RDD.foreachPartition()`:

def processRDD(rdd):
    def partition_func(records):
        analyzer = ShortTextAnalyzer(root_dir)
        for record in records:
            analyzer.analyze_short_text_event(record[1])
    rdd.foreachPartition(partition_func)

This will create one analyzer per partition and RDD.

Best

Eike

Reply via email to