to occur
>>>>> together since the fact that we are looping through the Seq is out of
>>>>> Spark's control.
>>>>>
>>>>> -Suren
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu
gt;
>>>> -Suren
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Mar 20, 2014 at 9:48 AM, Surendranauth Hiraman <
>>>> suren.hira...@velos.io> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>
rote:
>>>
>>>> Hi,
>>>>
>>>> My team is trying to replicate an existing Map/Reduce process in Spark.
>>>>
>>>> Basically, we are creating Bloom Filters for quick set membership tests
>>>> within our processing pipeline.
>
et membership tests
>>> within our processing pipeline.
>>>
>>> We have a single column (call it group_id) that we use to partition into
>>> sets.
>>>
>>> As you would expect, in the map phase, we emit the group_id as the key
>>> and
> As you would expect, in the map phase, we emit the group_id as the key
>> and in the reduce phase, we instantiate the Bloom Filter for a given key in
>> the setup() method and persist that Bloom Filter in the cleanup() method.
>>
>> In Spark, we can do something simil
ce phase, we instantiate the Bloom Filter for a given key in the
> setup() method and persist that Bloom Filter in the cleanup() method.
>
> In Spark, we can do something similar with map() and reduceByKey() but we
> have the following questions.
>
>
> 1. Accessing the reduce key
. Accessing the reduce key
In reduceByKey(), how do we get access to the specific key within the
reduce function?
2. Equivalent of setup/cleanup
Where should we instantiate and persist each Bloom Filter by key? In the
driver and then pass in the references to the reduce function? But if so,
how does the