Grouped by the group_id but not sorted. -Suren
On Thu, Mar 20, 2014 at 5:52 PM, Mayur Rustagi <[email protected]>wrote: > You are using the data grouped (sorted?) To create the bloom filter ? > On Mar 20, 2014 4:35 PM, "Surendranauth Hiraman" <[email protected]> > wrote: > >> Mayur, >> >> To be a little clearer, for creating the Bloom Filters, I don't think >> broadcast variables are the way to go, though definitely that would work >> for using the Bloom Filters to filter data. >> >> The reason why is that the creation needs to happen in a single thread. >> Otherwise, some type of locking/distributed locking is needed on the >> individual Bloom Filter itself, with performance impact. >> >> Agreed? >> >> -Suren >> >> >> >> >> On Thu, Mar 20, 2014 at 3:40 PM, Surendranauth Hiraman < >> [email protected]> wrote: >> >>> Mayur, >>> >>> Thanks. This step is for creating the Bloom Filter, not using it to >>> filter data, actually. But your answer still stands. >>> >>> Partitioning by key, having the bloom filters as a broadcast variable >>> and then doing mappartition makes sense. >>> >>> Are there performance implications for this approach, such as with using >>> the broadcast variable, versus the approach we used, in which the Bloom >>> Filter (again, for creating it) is only referenced by the single map >>> application? >>> >>> -Suren >>> >>> >>> >>> >>> >>> On Thu, Mar 20, 2014 at 3:20 PM, Mayur Rustagi >>> <[email protected]>wrote: >>> >>>> Why are you trying to reducebyKey? Are you looking to work on the data >>>> sequentially. >>>> If I understand correctly you are looking to filter your data using the >>>> bloom filter & each bloom filter is tied to which key is instantiating it. >>>> Following are some of the options >>>> *partiition* your data by key & use mappartition operator to run >>>> function on partition independently. The same function will be applied to >>>> each partition. >>>> If your bloomfilter is large then you can bundle all of them in as a >>>> broadcast variable & use it to apply the transformation on your data using >>>> a simple map operation, basically you are looking up the right bloom filter >>>> on each key & applying the filter on it, again here if unserializing bloom >>>> filter is time consuming then you can partition the data on key & then use >>>> the broadcast variable to look up the bloom filter for each key & apply >>>> filter on all data in serial. >>>> >>>> Regards >>>> Mayur >>>> >>>> Mayur Rustagi >>>> Ph: +1 (760) 203 3257 >>>> http://www.sigmoidanalytics.com >>>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>>> >>>> >>>> >>>> On Thu, Mar 20, 2014 at 1:55 PM, Surendranauth Hiraman < >>>> [email protected]> wrote: >>>> >>>>> We ended up going with: >>>>> >>>>> map() - set the group_id as the key in a Tuple >>>>> reduceByKey() - end up with (K,Seq[V]) >>>>> map() - create the bloom filter and loop through the Seq and persist >>>>> the Bloom filter >>>>> >>>>> This seems to be fine. >>>>> >>>>> I guess Spark cannot optimize the reduceByKey and map steps to occur >>>>> together since the fact that we are looping through the Seq is out of >>>>> Spark's control. >>>>> >>>>> -Suren >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Mar 20, 2014 at 9:48 AM, Surendranauth Hiraman < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> My team is trying to replicate an existing Map/Reduce process in >>>>>> Spark. >>>>>> >>>>>> Basically, we are creating Bloom Filters for quick set membership >>>>>> tests within our processing pipeline. >>>>>> >>>>>> We have a single column (call it group_id) that we use to partition >>>>>> into sets. >>>>>> >>>>>> As you would expect, in the map phase, we emit the group_id as the >>>>>> key and in the reduce phase, we instantiate the Bloom Filter for a given >>>>>> key in the setup() method and persist that Bloom Filter in the cleanup() >>>>>> method. >>>>>> >>>>>> In Spark, we can do something similar with map() and reduceByKey() >>>>>> but we have the following questions. >>>>>> >>>>>> >>>>>> 1. Accessing the reduce key >>>>>> In reduceByKey(), how do we get access to the specific key within the >>>>>> reduce function? >>>>>> >>>>>> >>>>>> 2. Equivalent of setup/cleanup >>>>>> Where should we instantiate and persist each Bloom Filter by key? In >>>>>> the driver and then pass in the references to the reduce function? But if >>>>>> so, how does the reduce function know which set's Bloom Filter it should >>>>>> be >>>>>> writing to (question 1 above)? >>>>>> >>>>>> It seems if we use groupByKey and then reduceByKey, that gives us >>>>>> access to all of the values at one go. I assume there, Spark will manage >>>>>> if >>>>>> those values all don't fit in memory in one go. >>>>>> >>>>>> >>>>>> >>>>>> SUREN HIRAMAN, VP TECHNOLOGY >>>>>> Velos >>>>>> Accelerating Machine Learning >>>>>> >>>>>> 440 NINTH AVENUE, 11TH FLOOR >>>>>> NEW YORK, NY 10001 >>>>>> O: (917) 525-2466 ext. 105 >>>>>> F: 646.349.4063 >>>>>> E: suren.hiraman@v <[email protected]>elos.io >>>>>> W: www.velos.io >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> SUREN HIRAMAN, VP TECHNOLOGY >>>>> Velos >>>>> Accelerating Machine Learning >>>>> >>>>> 440 NINTH AVENUE, 11TH FLOOR >>>>> NEW YORK, NY 10001 >>>>> O: (917) 525-2466 ext. 105 >>>>> F: 646.349.4063 >>>>> E: suren.hiraman@v <[email protected]>elos.io >>>>> W: www.velos.io >>>>> >>>>> >>>> >>> >>> >>> -- >>> >>> SUREN HIRAMAN, VP TECHNOLOGY >>> Velos >>> Accelerating Machine Learning >>> >>> 440 NINTH AVENUE, 11TH FLOOR >>> NEW YORK, NY 10001 >>> O: (917) 525-2466 ext. 105 >>> F: 646.349.4063 >>> E: suren.hiraman@v <[email protected]>elos.io >>> W: www.velos.io >>> >>> >> >> >> -- >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hiraman@v <[email protected]>elos.io >> W: www.velos.io >> >> -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v <[email protected]>elos.io W: www.velos.io
