Alternatively, using the broadcast functionality can also help with this. On Thu, Dec 29, 2016 at 3:05 AM Eike von Seggern <eike.segg...@sevenval.com> wrote:
> 2016-12-28 20:17 GMT+01:00 Chawla,Sumit <sumitkcha...@gmail.com>: > > Would this work for you? > > def processRDD(rdd): > analyzer = ShortTextAnalyzer(root_dir) > rdd.foreach(lambda record: > analyzer.analyze_short_text_event(record[1])) > > ssc.union(*streams).filter(lambda x: x[1] != None) > .foreachRDD(lambda rdd: processRDD(rdd)) > > > I think, this will still send each analyzer to all executors where rdd > partitions are stored. > > Maybe you can work around this with `RDD.foreachPartition()`: > > def processRDD(rdd): > def partition_func(records): > analyzer = ShortTextAnalyzer(root_dir) > for record in records: > analyzer.analyze_short_text_event(record[1]) > rdd.foreachPartition(partition_func) > > This will create one analyzer per partition and RDD. > > Best > > Eike >