I should also point out another library I had not come across before : https://github.com/sethah/spark-neighbors

On Tue, 20 Sep 2016 at 21:03 Kevin Mellott <kevin.r.mell...@gmail.com> wrote: > Using the Soundcloud implementation of LSH, I was able to process a 22K > product dataset in a mere 65 seconds! Thanks so much for the help! > > On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott <kevin.r.mell...@gmail.com> > wrote: > >> Thanks Nick - those examples will help a ton!! >> >> On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath < >> nick.pentre...@gmail.com> wrote: >> >>> A few options include: >>> >>> https://github.com/marufaytekin/lsh-spark - I've used this a bit and it >>> seems quite scalable too from what I've looked at. >>> https://github.com/soundcloud/cosine-lsh-join-spark - not used this but >>> looks like it should do exactly what you need. >>> https://github.com/mrsqueeze/*spark*-hash >>> <https://github.com/mrsqueeze/spark-hash> >>> >>> >>> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott <kevin.r.mell...@gmail.com> >>> wrote: >>> >>>> Thanks for the reply, Nick! I'm typically analyzing around 30-50K >>>> products at a time (as an isolated set of products). Within this set of >>>> products (which represents all products for a particular supplier), I am >>>> also analyzing each category separately. The largest categories typically >>>> have around 10K products. >>>> >>>> That being said, when calculating IDFs for the 10K product set we come >>>> out with roughly 12K unique tokens. In other words, our vectors are 12K >>>> columns wide (although they are being represented using SparseVectors). We >>>> have a step that is attempting to locate all documents that share the same >>>> tokens, and for those items we will calculate the cosine similarity. >>>> However, the part that attempts to identify documents with shared tokens is >>>> the bottleneck. >>>> >>>> For this portion, we map our data down to the individual tokens >>>> contained by each document. For example: >>>> >>>> DocumentId | Description >>>> >>>> ---------------------------------------------------------------------------------------------------- >>>> 1 Easton Hockey Stick >>>> 2 Bauer Hockey Gloves >>>> >>>> In this case, we'd map to the following: >>>> >>>> (1, 'Easton') >>>> (1, 'Hockey') >>>> (1, 'Stick') >>>> (2, 'Bauer') >>>> (2, 'Hockey') >>>> (2, 'Gloves') >>>> >>>> Our goal is to aggregate this data as follows; however, our code that >>>> currently does this is does not perform well. In the realistic 12K product >>>> scenario, this resulted in 430K document/token tuples. >>>> >>>> ((1, 2), ['Hockey']) >>>> >>>> This then tells us that documents 1 and 2 need to be compared to one >>>> another (via cosine similarity) because they both contain the token >>>> 'hockey'. I will investigate the methods that you recommended to see if >>>> they may resolve our problem. >>>> >>>> Thanks, >>>> Kevin >>>> >>>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath < >>>> nick.pentre...@gmail.com> wrote: >>>> >>>>> How many products do you have? How large are your vectors? >>>>> >>>>> It could be that SVD / LSA could be helpful. But if you have many >>>>> products then trying to compute all-pair similarity with brute force is >>>>> not >>>>> going to be scalable. In this case you may want to investigate hashing >>>>> (LSH) techniques. >>>>> >>>>> >>>>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <kevin.r.mell...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I'm trying to write a Spark application that will detect similar >>>>>> items (in this case products) based on their descriptions. I've got an ML >>>>>> pipeline that transforms the product data to TF-IDF representation, using >>>>>> the following components. >>>>>> >>>>>> - *RegexTokenizer* - strips out non-word characters, results in a >>>>>> list of tokens >>>>>> - *StopWordsRemover* - removes common "stop words", such as >>>>>> "the", "and", etc. >>>>>> - *HashingTF* - assigns a numeric "hash" to each token and >>>>>> calculates the term frequency >>>>>> - *IDF* - computes the inverse document frequency >>>>>> >>>>>> After this pipeline evaluates, I'm left with a SparseVector that >>>>>> represents the inverse document frequency of tokens for each product. As >>>>>> a >>>>>> next step, I'd like to be able to compare each vector to one another, to >>>>>> detect similarities. >>>>>> >>>>>> Does anybody know of a straightforward way to do this in Spark? I >>>>>> tried creating a UDF (that used the Breeze linear algebra methods >>>>>> internally); however, that did not scale well. >>>>>> >>>>>> Thanks, >>>>>> Kevin >>>>>> >>>>> >>>> >> >