You are using the data grouped (sorted?) To create the bloom filter ? On Mar 20, 2014 4:35 PM, "Surendranauth Hiraman" <[email protected]> wrote:
> Mayur, > > To be a little clearer, for creating the Bloom Filters, I don't think > broadcast variables are the way to go, though definitely that would work > for using the Bloom Filters to filter data. > > The reason why is that the creation needs to happen in a single thread. > Otherwise, some type of locking/distributed locking is needed on the > individual Bloom Filter itself, with performance impact. > > Agreed? > > -Suren > > > > > On Thu, Mar 20, 2014 at 3:40 PM, Surendranauth Hiraman < > [email protected]> wrote: > >> Mayur, >> >> Thanks. This step is for creating the Bloom Filter, not using it to >> filter data, actually. But your answer still stands. >> >> Partitioning by key, having the bloom filters as a broadcast variable and >> then doing mappartition makes sense. >> >> Are there performance implications for this approach, such as with using >> the broadcast variable, versus the approach we used, in which the Bloom >> Filter (again, for creating it) is only referenced by the single map >> application? >> >> -Suren >> >> >> >> >> >> On Thu, Mar 20, 2014 at 3:20 PM, Mayur Rustagi >> <[email protected]>wrote: >> >>> Why are you trying to reducebyKey? Are you looking to work on the data >>> sequentially. >>> If I understand correctly you are looking to filter your data using the >>> bloom filter & each bloom filter is tied to which key is instantiating it. >>> Following are some of the options >>> *partiition* your data by key & use mappartition operator to run >>> function on partition independently. The same function will be applied to >>> each partition. >>> If your bloomfilter is large then you can bundle all of them in as a >>> broadcast variable & use it to apply the transformation on your data using >>> a simple map operation, basically you are looking up the right bloom filter >>> on each key & applying the filter on it, again here if unserializing bloom >>> filter is time consuming then you can partition the data on key & then use >>> the broadcast variable to look up the bloom filter for each key & apply >>> filter on all data in serial. >>> >>> Regards >>> Mayur >>> >>> Mayur Rustagi >>> Ph: +1 (760) 203 3257 >>> http://www.sigmoidanalytics.com >>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>> >>> >>> >>> On Thu, Mar 20, 2014 at 1:55 PM, Surendranauth Hiraman < >>> [email protected]> wrote: >>> >>>> We ended up going with: >>>> >>>> map() - set the group_id as the key in a Tuple >>>> reduceByKey() - end up with (K,Seq[V]) >>>> map() - create the bloom filter and loop through the Seq and persist >>>> the Bloom filter >>>> >>>> This seems to be fine. >>>> >>>> I guess Spark cannot optimize the reduceByKey and map steps to occur >>>> together since the fact that we are looping through the Seq is out of >>>> Spark's control. >>>> >>>> -Suren >>>> >>>> >>>> >>>> >>>> On Thu, Mar 20, 2014 at 9:48 AM, Surendranauth Hiraman < >>>> [email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> My team is trying to replicate an existing Map/Reduce process in Spark. >>>>> >>>>> Basically, we are creating Bloom Filters for quick set membership >>>>> tests within our processing pipeline. >>>>> >>>>> We have a single column (call it group_id) that we use to partition >>>>> into sets. >>>>> >>>>> As you would expect, in the map phase, we emit the group_id as the key >>>>> and in the reduce phase, we instantiate the Bloom Filter for a given key >>>>> in >>>>> the setup() method and persist that Bloom Filter in the cleanup() method. >>>>> >>>>> In Spark, we can do something similar with map() and reduceByKey() but >>>>> we have the following questions. >>>>> >>>>> >>>>> 1. Accessing the reduce key >>>>> In reduceByKey(), how do we get access to the specific key within the >>>>> reduce function? >>>>> >>>>> >>>>> 2. Equivalent of setup/cleanup >>>>> Where should we instantiate and persist each Bloom Filter by key? In >>>>> the driver and then pass in the references to the reduce function? But if >>>>> so, how does the reduce function know which set's Bloom Filter it should >>>>> be >>>>> writing to (question 1 above)? >>>>> >>>>> It seems if we use groupByKey and then reduceByKey, that gives us >>>>> access to all of the values at one go. I assume there, Spark will manage >>>>> if >>>>> those values all don't fit in memory in one go. >>>>> >>>>> >>>>> >>>>> SUREN HIRAMAN, VP TECHNOLOGY >>>>> Velos >>>>> Accelerating Machine Learning >>>>> >>>>> 440 NINTH AVENUE, 11TH FLOOR >>>>> NEW YORK, NY 10001 >>>>> O: (917) 525-2466 ext. 105 >>>>> F: 646.349.4063 >>>>> E: suren.hiraman@v <[email protected]>elos.io >>>>> W: www.velos.io >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> SUREN HIRAMAN, VP TECHNOLOGY >>>> Velos >>>> Accelerating Machine Learning >>>> >>>> 440 NINTH AVENUE, 11TH FLOOR >>>> NEW YORK, NY 10001 >>>> O: (917) 525-2466 ext. 105 >>>> F: 646.349.4063 >>>> E: suren.hiraman@v <[email protected]>elos.io >>>> W: www.velos.io >>>> >>>> >>> >> >> >> -- >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hiraman@v <[email protected]>elos.io >> W: www.velos.io >> >> > > > -- > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v <[email protected]>elos.io > W: www.velos.io > >
