We ended up going with: map() - set the group_id as the key in a Tuple reduceByKey() - end up with (K,Seq[V]) map() - create the bloom filter and loop through the Seq and persist the Bloom filter
This seems to be fine. I guess Spark cannot optimize the reduceByKey and map steps to occur together since the fact that we are looping through the Seq is out of Spark's control. -Suren On Thu, Mar 20, 2014 at 9:48 AM, Surendranauth Hiraman < [email protected]> wrote: > Hi, > > My team is trying to replicate an existing Map/Reduce process in Spark. > > Basically, we are creating Bloom Filters for quick set membership tests > within our processing pipeline. > > We have a single column (call it group_id) that we use to partition into > sets. > > As you would expect, in the map phase, we emit the group_id as the key and > in the reduce phase, we instantiate the Bloom Filter for a given key in the > setup() method and persist that Bloom Filter in the cleanup() method. > > In Spark, we can do something similar with map() and reduceByKey() but we > have the following questions. > > > 1. Accessing the reduce key > In reduceByKey(), how do we get access to the specific key within the > reduce function? > > > 2. Equivalent of setup/cleanup > Where should we instantiate and persist each Bloom Filter by key? In the > driver and then pass in the references to the reduce function? But if so, > how does the reduce function know which set's Bloom Filter it should be > writing to (question 1 above)? > > It seems if we use groupByKey and then reduceByKey, that gives us access > to all of the values at one go. I assume there, Spark will manage if those > values all don't fit in memory in one go. > > > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v <[email protected]>elos.io > W: www.velos.io > > > -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v <[email protected]>elos.io W: www.velos.io
