RK Aduri,
Another idea is to union all results and then run collect.
The question is how big collected data is.

2016-07-20 20:32 GMT+02:00 RK Aduri <rkad...@collectivei.com>:
> Spark version: 1.6.0
> So, here is the background:
>
>         I have a data frame (Large_Row_DataFrame) which I have created from an
> array of row objects and also have another array of unique ids (U_ID) which
> I’m going to use to look up into the Large_Row_DataFrame (which is cached)
> to do a customized function.
>        For the each lookup for each unique id, I do a collect on the cached
> dataframe Large_Row_DataFrame. This means that they would be a bunch of
> ‘collect’ actions which Spark has to run. Since I’m executing this in a loop
> for each unique id (U_ID), all the such collect actions run in sequential
> mode.
>
> Solution that I implemented:
>
> To avoid the sequential wait of each collect, I have created few subsets of
> unique ids with a specific size and run each thread for such a subset. For
> each such subset, I executed a thread which is a spark job that runs
> collects in sequence only for that subset. And, I have created as many
> threads as subsets, each thread handling each subset. Surprisingly, The
> resultant run time is better than the earlier sequential approach.
>
> Now the question:
>
>         Is the multithreading a correct approach towards the solution? Or 
> could
> there be a better way of doing this.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/MultiThreading-in-Spark-1-6-0-tp27374.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>



-- 
Maciek Bryński

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to