We ended up going with:

map() - set the group_id as the key in a Tuple
reduceByKey() - end up with (K,Seq[V])
map() - create the bloom filter and loop through the Seq and persist the
Bloom filter

This seems to be fine.

I guess Spark cannot optimize the reduceByKey and map steps to occur
together since the fact that we are looping through the Seq is out of
Spark's control.

-Suren




On Thu, Mar 20, 2014 at 9:48 AM, Surendranauth Hiraman <
[email protected]> wrote:

> Hi,
>
> My team is trying to replicate an existing Map/Reduce process in Spark.
>
> Basically, we are creating Bloom Filters for quick set membership tests
> within our processing pipeline.
>
> We have a single column (call it group_id) that we use to partition into
> sets.
>
> As you would expect, in the map phase, we emit the group_id as the key and
> in the reduce phase, we instantiate the Bloom Filter for a given key in the
> setup() method and persist that Bloom Filter in the cleanup() method.
>
> In Spark, we can do something similar with map() and reduceByKey() but we
> have the following questions.
>
>
> 1. Accessing the reduce key
> In reduceByKey(), how do we get access to the specific key within the
> reduce function?
>
>
> 2. Equivalent of setup/cleanup
> Where should we instantiate and persist each Bloom Filter by key? In the
> driver and then pass in the references to the reduce function? But if so,
> how does the reduce function know which set's Bloom Filter it should be
> writing to (question 1 above)?
>
> It seems if we use groupByKey and then reduceByKey, that gives us access
> to all of the values at one go. I assume there, Spark will manage if those
> values all don't fit in memory in one go.
>
>
>
> SUREN HIRAMAN, VP TECHNOLOGY
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR
> NEW YORK, NY 10001
> O: (917) 525-2466 ext. 105
> F: 646.349.4063
> E: suren.hiraman@v <[email protected]>elos.io
> W: www.velos.io
>
>
>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <[email protected]>elos.io
W: www.velos.io

Reply via email to