RK Aduri, Another idea is to union all results and then run collect. The question is how big collected data is.
2016-07-20 20:32 GMT+02:00 RK Aduri <rkad...@collectivei.com>: > Spark version: 1.6.0 > So, here is the background: > > I have a data frame (Large_Row_DataFrame) which I have created from an > array of row objects and also have another array of unique ids (U_ID) which > I’m going to use to look up into the Large_Row_DataFrame (which is cached) > to do a customized function. > For the each lookup for each unique id, I do a collect on the cached > dataframe Large_Row_DataFrame. This means that they would be a bunch of > ‘collect’ actions which Spark has to run. Since I’m executing this in a loop > for each unique id (U_ID), all the such collect actions run in sequential > mode. > > Solution that I implemented: > > To avoid the sequential wait of each collect, I have created few subsets of > unique ids with a specific size and run each thread for such a subset. For > each such subset, I executed a thread which is a spark job that runs > collects in sequence only for that subset. And, I have created as many > threads as subsets, each thread handling each subset. Surprisingly, The > resultant run time is better than the earlier sequential approach. > > Now the question: > > Is the multithreading a correct approach towards the solution? Or > could > there be a better way of doing this. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/MultiThreading-in-Spark-1-6-0-tp27374.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > -- Maciek Bryński --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org