A few options include: https://github.com/marufaytekin/lsh-spark - I've used this a bit and it seems quite scalable too from what I've looked at. https://github.com/soundcloud/cosine-lsh-join-spark - not used this but looks like it should do exactly what you need. https://github.com/mrsqueeze/*spark*-hash <https://github.com/mrsqueeze/spark-hash>
On Tue, 20 Sep 2016 at 18:06 Kevin Mellott <kevin.r.mell...@gmail.com> wrote: > Thanks for the reply, Nick! I'm typically analyzing around 30-50K products > at a time (as an isolated set of products). Within this set of products > (which represents all products for a particular supplier), I am also > analyzing each category separately. The largest categories typically have > around 10K products. > > That being said, when calculating IDFs for the 10K product set we come out > with roughly 12K unique tokens. In other words, our vectors are 12K columns > wide (although they are being represented using SparseVectors). We have a > step that is attempting to locate all documents that share the same tokens, > and for those items we will calculate the cosine similarity. However, the > part that attempts to identify documents with shared tokens is the > bottleneck. > > For this portion, we map our data down to the individual tokens contained > by each document. For example: > > DocumentId | Description > > ---------------------------------------------------------------------------------------------------- > 1 Easton Hockey Stick > 2 Bauer Hockey Gloves > > In this case, we'd map to the following: > > (1, 'Easton') > (1, 'Hockey') > (1, 'Stick') > (2, 'Bauer') > (2, 'Hockey') > (2, 'Gloves') > > Our goal is to aggregate this data as follows; however, our code that > currently does this is does not perform well. In the realistic 12K product > scenario, this resulted in 430K document/token tuples. > > ((1, 2), ['Hockey']) > > This then tells us that documents 1 and 2 need to be compared to one > another (via cosine similarity) because they both contain the token > 'hockey'. I will investigate the methods that you recommended to see if > they may resolve our problem. > > Thanks, > Kevin > > On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> How many products do you have? How large are your vectors? >> >> It could be that SVD / LSA could be helpful. But if you have many >> products then trying to compute all-pair similarity with brute force is not >> going to be scalable. In this case you may want to investigate hashing >> (LSH) techniques. >> >> >> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <kevin.r.mell...@gmail.com> >> wrote: >> >>> Hi all, >>> >>> I'm trying to write a Spark application that will detect similar items >>> (in this case products) based on their descriptions. I've got an ML >>> pipeline that transforms the product data to TF-IDF representation, using >>> the following components. >>> >>> - *RegexTokenizer* - strips out non-word characters, results in a >>> list of tokens >>> - *StopWordsRemover* - removes common "stop words", such as "the", >>> "and", etc. >>> - *HashingTF* - assigns a numeric "hash" to each token and >>> calculates the term frequency >>> - *IDF* - computes the inverse document frequency >>> >>> After this pipeline evaluates, I'm left with a SparseVector that >>> represents the inverse document frequency of tokens for each product. As a >>> next step, I'd like to be able to compare each vector to one another, to >>> detect similarities. >>> >>> Does anybody know of a straightforward way to do this in Spark? I tried >>> creating a UDF (that used the Breeze linear algebra methods internally); >>> however, that did not scale well. >>> >>> Thanks, >>> Kevin >>> >> >