I should also point out another library I had not come across before :
https://github.com/sethah/spark-neighbors

On Tue, 20 Sep 2016 at 21:03 Kevin Mellott <kevin.r.mell...@gmail.com>
wrote:

> Using the Soundcloud implementation of LSH, I was able to process a 22K
> product dataset in a mere 65 seconds! Thanks so much for the help!
>
> On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott <kevin.r.mell...@gmail.com>
> wrote:
>
>> Thanks Nick - those examples will help a ton!!
>>
>> On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath <
>> nick.pentre...@gmail.com> wrote:
>>
>>> A few options include:
>>>
>>> https://github.com/marufaytekin/lsh-spark - I've used this a bit and it
>>> seems quite scalable too from what I've looked at.
>>> https://github.com/soundcloud/cosine-lsh-join-spark - not used this but
>>> looks like it should do exactly what you need.
>>> https://github.com/mrsqueeze/*spark*-hash
>>> <https://github.com/mrsqueeze/spark-hash>
>>>
>>>
>>> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott <kevin.r.mell...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for the reply, Nick! I'm typically analyzing around 30-50K
>>>> products at a time (as an isolated set of products). Within this set of
>>>> products (which represents all products for a particular supplier), I am
>>>> also analyzing each category separately. The largest categories typically
>>>> have around 10K products.
>>>>
>>>> That being said, when calculating IDFs for the 10K product set we come
>>>> out with roughly 12K unique tokens. In other words, our vectors are 12K
>>>> columns wide (although they are being represented using SparseVectors). We
>>>> have a step that is attempting to locate all documents that share the same
>>>> tokens, and for those items we will calculate the cosine similarity.
>>>> However, the part that attempts to identify documents with shared tokens is
>>>> the bottleneck.
>>>>
>>>> For this portion, we map our data down to the individual tokens
>>>> contained by each document. For example:
>>>>
>>>> DocumentId   |   Description
>>>>
>>>> ----------------------------------------------------------------------------------------------------
>>>> 1                       Easton Hockey Stick
>>>> 2                       Bauer Hockey Gloves
>>>>
>>>> In this case, we'd map to the following:
>>>>
>>>> (1, 'Easton')
>>>> (1, 'Hockey')
>>>> (1, 'Stick')
>>>> (2, 'Bauer')
>>>> (2, 'Hockey')
>>>> (2, 'Gloves')
>>>>
>>>> Our goal is to aggregate this data as follows; however, our code that
>>>> currently does this is does not perform well. In the realistic 12K product
>>>> scenario, this resulted in 430K document/token tuples.
>>>>
>>>> ((1, 2), ['Hockey'])
>>>>
>>>> This then tells us that documents 1 and 2 need to be compared to one
>>>> another (via cosine similarity) because they both contain the token
>>>> 'hockey'. I will investigate the methods that you recommended to see if
>>>> they may resolve our problem.
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> How many products do you have? How large are your vectors?
>>>>>
>>>>> It could be that SVD / LSA could be helpful. But if you have many
>>>>> products then trying to compute all-pair similarity with brute force is 
>>>>> not
>>>>> going to be scalable. In this case you may want to investigate hashing
>>>>> (LSH) techniques.
>>>>>
>>>>>
>>>>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <kevin.r.mell...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm trying to write a Spark application that will detect similar
>>>>>> items (in this case products) based on their descriptions. I've got an ML
>>>>>> pipeline that transforms the product data to TF-IDF representation, using
>>>>>> the following components.
>>>>>>
>>>>>>    - *RegexTokenizer* - strips out non-word characters, results in a
>>>>>>    list of tokens
>>>>>>    - *StopWordsRemover* - removes common "stop words", such as
>>>>>>    "the", "and", etc.
>>>>>>    - *HashingTF* - assigns a numeric "hash" to each token and
>>>>>>    calculates the term frequency
>>>>>>    - *IDF* - computes the inverse document frequency
>>>>>>
>>>>>> After this pipeline evaluates, I'm left with a SparseVector that
>>>>>> represents the inverse document frequency of tokens for each product. As 
>>>>>> a
>>>>>> next step, I'd like to be able to compare each vector to one another, to
>>>>>> detect similarities.
>>>>>>
>>>>>> Does anybody know of a straightforward way to do this in Spark? I
>>>>>> tried creating a UDF (that used the Breeze linear algebra methods
>>>>>> internally); however, that did not scale well.
>>>>>>
>>>>>> Thanks,
>>>>>> Kevin
>>>>>>
>>>>>
>>>>
>>
>

Reply via email to