Grouped by the group_id but not sorted.

-Suren



On Thu, Mar 20, 2014 at 5:52 PM, Mayur Rustagi <[email protected]>wrote:

> You are using the data grouped (sorted?) To create the bloom filter ?
> On Mar 20, 2014 4:35 PM, "Surendranauth Hiraman" <[email protected]>
> wrote:
>
>> Mayur,
>>
>> To be a little clearer, for creating the Bloom Filters, I don't think
>> broadcast variables are the way to go, though definitely that would work
>> for using the Bloom Filters to filter data.
>>
>> The reason why is that the creation needs to happen in a single thread.
>> Otherwise, some type of locking/distributed locking is needed on the
>> individual Bloom Filter itself, with performance impact.
>>
>> Agreed?
>>
>> -Suren
>>
>>
>>
>>
>> On Thu, Mar 20, 2014 at 3:40 PM, Surendranauth Hiraman <
>> [email protected]> wrote:
>>
>>> Mayur,
>>>
>>> Thanks. This step is for creating the Bloom Filter, not using it to
>>> filter data, actually. But your answer still stands.
>>>
>>> Partitioning by key, having the bloom filters as a broadcast variable
>>> and then doing mappartition makes sense.
>>>
>>> Are there performance implications for this approach, such as with using
>>> the broadcast variable, versus the approach we used, in which the Bloom
>>> Filter (again, for creating it) is only referenced by the single map
>>> application?
>>>
>>> -Suren
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Mar 20, 2014 at 3:20 PM, Mayur Rustagi 
>>> <[email protected]>wrote:
>>>
>>>> Why are you trying to reducebyKey? Are you looking to work on the data
>>>> sequentially.
>>>> If I understand correctly you are looking to filter your data using the
>>>> bloom filter & each bloom filter is tied to which key is instantiating it.
>>>> Following are some of the options
>>>> *partiition* your data by key & use mappartition operator to run
>>>> function on partition independently. The same function will be applied to
>>>> each partition.
>>>> If your bloomfilter is large then you can bundle all of them in as a
>>>> broadcast variable & use it to apply the transformation on your data using
>>>> a simple map operation, basically you are looking up the right bloom filter
>>>> on each key & applying the filter on it, again here if unserializing bloom
>>>> filter is time consuming then you can partition the data on key & then use
>>>> the broadcast variable to look up the bloom filter for each key & apply
>>>> filter on all data in serial.
>>>>
>>>> Regards
>>>> Mayur
>>>>
>>>> Mayur Rustagi
>>>> Ph: +1 (760) 203 3257
>>>> http://www.sigmoidanalytics.com
>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>
>>>>
>>>>
>>>> On Thu, Mar 20, 2014 at 1:55 PM, Surendranauth Hiraman <
>>>> [email protected]> wrote:
>>>>
>>>>> We ended up going with:
>>>>>
>>>>> map() - set the group_id as the key in a Tuple
>>>>> reduceByKey() - end up with (K,Seq[V])
>>>>> map() - create the bloom filter and loop through the Seq and persist
>>>>> the Bloom filter
>>>>>
>>>>> This seems to be fine.
>>>>>
>>>>> I guess Spark cannot optimize the reduceByKey and map steps to occur
>>>>> together since the fact that we are looping through the Seq is out of
>>>>> Spark's control.
>>>>>
>>>>> -Suren
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 20, 2014 at 9:48 AM, Surendranauth Hiraman <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> My team is trying to replicate an existing Map/Reduce process in
>>>>>> Spark.
>>>>>>
>>>>>> Basically, we are creating Bloom Filters for quick set membership
>>>>>> tests within our processing pipeline.
>>>>>>
>>>>>> We have a single column (call it group_id) that we use to partition
>>>>>> into sets.
>>>>>>
>>>>>> As you would expect, in the map phase, we emit the group_id as the
>>>>>> key and in the reduce phase, we instantiate the Bloom Filter for a given
>>>>>> key in the setup() method and persist that Bloom Filter in the cleanup()
>>>>>> method.
>>>>>>
>>>>>> In Spark, we can do something similar with map() and reduceByKey()
>>>>>> but we have the following questions.
>>>>>>
>>>>>>
>>>>>> 1. Accessing the reduce key
>>>>>> In reduceByKey(), how do we get access to the specific key within the
>>>>>> reduce function?
>>>>>>
>>>>>>
>>>>>> 2. Equivalent of setup/cleanup
>>>>>> Where should we instantiate and persist each Bloom Filter by key? In
>>>>>> the driver and then pass in the references to the reduce function? But if
>>>>>> so, how does the reduce function know which set's Bloom Filter it should 
>>>>>> be
>>>>>> writing to (question 1 above)?
>>>>>>
>>>>>> It seems if we use groupByKey and then reduceByKey, that gives us
>>>>>> access to all of the values at one go. I assume there, Spark will manage 
>>>>>> if
>>>>>> those values all don't fit in memory in one go.
>>>>>>
>>>>>>
>>>>>>
>>>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>>>> Velos
>>>>>> Accelerating Machine Learning
>>>>>>
>>>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>>>> NEW YORK, NY 10001
>>>>>> O: (917) 525-2466 ext. 105
>>>>>> F: 646.349.4063
>>>>>> E: suren.hiraman@v <[email protected]>elos.io
>>>>>> W: www.velos.io
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> SUREN HIRAMAN, VP TECHNOLOGY
>>>>> Velos
>>>>> Accelerating Machine Learning
>>>>>
>>>>> 440 NINTH AVENUE, 11TH FLOOR
>>>>> NEW YORK, NY 10001
>>>>> O: (917) 525-2466 ext. 105
>>>>> F: 646.349.4063
>>>>> E: suren.hiraman@v <[email protected]>elos.io
>>>>> W: www.velos.io
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> SUREN HIRAMAN, VP TECHNOLOGY
>>> Velos
>>> Accelerating Machine Learning
>>>
>>> 440 NINTH AVENUE, 11TH FLOOR
>>> NEW YORK, NY 10001
>>> O: (917) 525-2466 ext. 105
>>> F: 646.349.4063
>>> E: suren.hiraman@v <[email protected]>elos.io
>>> W: www.velos.io
>>>
>>>
>>
>>
>> --
>>
>> SUREN HIRAMAN, VP TECHNOLOGY
>> Velos
>> Accelerating Machine Learning
>>
>> 440 NINTH AVENUE, 11TH FLOOR
>> NEW YORK, NY 10001
>> O: (917) 525-2466 ext. 105
>> F: 646.349.4063
>> E: suren.hiraman@v <[email protected]>elos.io
>> W: www.velos.io
>>
>>


-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <[email protected]>elos.io
W: www.velos.io

Reply via email to