Hi,

My team is trying to replicate an existing Map/Reduce process in Spark.

Basically, we are creating Bloom Filters for quick set membership tests
within our processing pipeline.

We have a single column (call it group_id) that we use to partition into
sets.

As you would expect, in the map phase, we emit the group_id as the key and
in the reduce phase, we instantiate the Bloom Filter for a given key in the
setup() method and persist that Bloom Filter in the cleanup() method.

In Spark, we can do something similar with map() and reduceByKey() but we
have the following questions.


1. Accessing the reduce key
In reduceByKey(), how do we get access to the specific key within the
reduce function?


2. Equivalent of setup/cleanup
Where should we instantiate and persist each Bloom Filter by key? In the
driver and then pass in the references to the reduce function? But if so,
how does the reduce function know which set's Bloom Filter it should be
writing to (question 1 above)?

It seems if we use groupByKey and then reduceByKey, that gives us access to
all of the values at one go. I assume there, Spark will manage if those
values all don't fit in memory in one go.



SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <[email protected]>elos.io
W: www.velos.io

Reply via email to