Is there a check you can put in place to not create pairs that aren't in
your set of 20M pairs? Additionally, once you have your arrays converted to
pairs you can do aggregateByKey with each pair being the key.
On Feb 20, 2015 1:57 PM, "shlomib" <shl...@summerhq.com> wrote:

> Hi,
>
> I am new to Spark and I think I missed something very basic.
>
> I have the following use case (I use Java and run Spark locally on my
> laptop):
>
>
> I have a JavaRDD<String[]>
>
> - The RDD contains around 72,000 arrays of strings (String[])
>
> - Each array contains 80 words (on average).
>
>
> What I want to do is to convert each array into a new array/list of pairs,
> for example:
>
> Input: String[] words = ['a', 'b', 'c']
>
> Output: List[<String, Sting>] pairs = [('a', 'b'), (a', 'c'), (b', 'c')]
>
> and then I want to count the number of times each pair appeared, so my
> final
> output should be something like:
>
> Output: List[<String, Sting, Integer>] result = [('a', 'b', 3), (a', 'c',
> 8), (b', 'c', 10)]
>
>
> The problem:
>
> Since each array contains around 80 words, it returns around 3,200 pairs,
> so
> after “mapping” my entire RDD I get 3,200 * 72,000 = *230,400,000* pairs to
> reduce which require way too much memory.
>
> (I know I have only around *20,000,000* unique pairs!)
>
> I already modified my code and used 'mapPartitions' instead of 'map'. It
> definitely improved the performance, but I still feel I'm doing something
> completely wrong.
>
>
> I was wondering if this is the right 'Spark way' to solve this kind of
> problem, or maybe I should do something like splitting my original RDD into
> smaller parts (by using randomSplit), then iterate over each part,
> aggregate
> the results into some result RDD (by using 'union') and move on to the next
> part.
>
>
> Can anyone please explain me which solution is better?
>
>
> Thank you very much,
>
> Shlomi.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/randomSplit-instead-of-a-huge-map-reduce-tp21744.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to