A few options include:

https://github.com/marufaytekin/lsh-spark - I've used this a bit and it
seems quite scalable too from what I've looked at.
https://github.com/soundcloud/cosine-lsh-join-spark - not used this but
looks like it should do exactly what you need.
https://github.com/mrsqueeze/*spark*-hash
<https://github.com/mrsqueeze/spark-hash>


On Tue, 20 Sep 2016 at 18:06 Kevin Mellott <kevin.r.mell...@gmail.com>
wrote:

> Thanks for the reply, Nick! I'm typically analyzing around 30-50K products
> at a time (as an isolated set of products). Within this set of products
> (which represents all products for a particular supplier), I am also
> analyzing each category separately. The largest categories typically have
> around 10K products.
>
> That being said, when calculating IDFs for the 10K product set we come out
> with roughly 12K unique tokens. In other words, our vectors are 12K columns
> wide (although they are being represented using SparseVectors). We have a
> step that is attempting to locate all documents that share the same tokens,
> and for those items we will calculate the cosine similarity. However, the
> part that attempts to identify documents with shared tokens is the
> bottleneck.
>
> For this portion, we map our data down to the individual tokens contained
> by each document. For example:
>
> DocumentId   |   Description
>
> ----------------------------------------------------------------------------------------------------
> 1                       Easton Hockey Stick
> 2                       Bauer Hockey Gloves
>
> In this case, we'd map to the following:
>
> (1, 'Easton')
> (1, 'Hockey')
> (1, 'Stick')
> (2, 'Bauer')
> (2, 'Hockey')
> (2, 'Gloves')
>
> Our goal is to aggregate this data as follows; however, our code that
> currently does this is does not perform well. In the realistic 12K product
> scenario, this resulted in 430K document/token tuples.
>
> ((1, 2), ['Hockey'])
>
> This then tells us that documents 1 and 2 need to be compared to one
> another (via cosine similarity) because they both contain the token
> 'hockey'. I will investigate the methods that you recommended to see if
> they may resolve our problem.
>
> Thanks,
> Kevin
>
> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> How many products do you have? How large are your vectors?
>>
>> It could be that SVD / LSA could be helpful. But if you have many
>> products then trying to compute all-pair similarity with brute force is not
>> going to be scalable. In this case you may want to investigate hashing
>> (LSH) techniques.
>>
>>
>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <kevin.r.mell...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to write a Spark application that will detect similar items
>>> (in this case products) based on their descriptions. I've got an ML
>>> pipeline that transforms the product data to TF-IDF representation, using
>>> the following components.
>>>
>>>    - *RegexTokenizer* - strips out non-word characters, results in a
>>>    list of tokens
>>>    - *StopWordsRemover* - removes common "stop words", such as "the",
>>>    "and", etc.
>>>    - *HashingTF* - assigns a numeric "hash" to each token and
>>>    calculates the term frequency
>>>    - *IDF* - computes the inverse document frequency
>>>
>>> After this pipeline evaluates, I'm left with a SparseVector that
>>> represents the inverse document frequency of tokens for each product. As a
>>> next step, I'd like to be able to compare each vector to one another, to
>>> detect similarities.
>>>
>>> Does anybody know of a straightforward way to do this in Spark? I tried
>>> creating a UDF (that used the Breeze linear algebra methods internally);
>>> however, that did not scale well.
>>>
>>> Thanks,
>>> Kevin
>>>
>>
>

Reply via email to