Sorry, the original repo: https://github.com/karlhigley/spark-neighbors

On Wed, 21 Sep 2016 at 13:09 Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> I should also point out another library I had not come across before :
> https://github.com/sethah/spark-neighbors
>
>
> On Tue, 20 Sep 2016 at 21:03 Kevin Mellott <kevin.r.mell...@gmail.com>
> wrote:
>
>> Using the Soundcloud implementation of LSH, I was able to process a 22K
>> product dataset in a mere 65 seconds! Thanks so much for the help!
>>
>> On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott <kevin.r.mell...@gmail.com
>> > wrote:
>>
>>> Thanks Nick - those examples will help a ton!!
>>>
>>> On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
>>>> A few options include:
>>>>
>>>> https://github.com/marufaytekin/lsh-spark - I've used this a bit and
>>>> it seems quite scalable too from what I've looked at.
>>>> https://github.com/soundcloud/cosine-lsh-join-spark - not used this
>>>> but looks like it should do exactly what you need.
>>>> https://github.com/mrsqueeze/*spark*-hash
>>>> <https://github.com/mrsqueeze/spark-hash>
>>>>
>>>>
>>>> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott <kevin.r.mell...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for the reply, Nick! I'm typically analyzing around 30-50K
>>>>> products at a time (as an isolated set of products). Within this set of
>>>>> products (which represents all products for a particular supplier), I am
>>>>> also analyzing each category separately. The largest categories typically
>>>>> have around 10K products.
>>>>>
>>>>> That being said, when calculating IDFs for the 10K product set we come
>>>>> out with roughly 12K unique tokens. In other words, our vectors are 12K
>>>>> columns wide (although they are being represented using SparseVectors). We
>>>>> have a step that is attempting to locate all documents that share the same
>>>>> tokens, and for those items we will calculate the cosine similarity.
>>>>> However, the part that attempts to identify documents with shared tokens 
>>>>> is
>>>>> the bottleneck.
>>>>>
>>>>> For this portion, we map our data down to the individual tokens
>>>>> contained by each document. For example:
>>>>>
>>>>> DocumentId   |   Description
>>>>>
>>>>> ----------------------------------------------------------------------------------------------------
>>>>> 1                       Easton Hockey Stick
>>>>> 2                       Bauer Hockey Gloves
>>>>>
>>>>> In this case, we'd map to the following:
>>>>>
>>>>> (1, 'Easton')
>>>>> (1, 'Hockey')
>>>>> (1, 'Stick')
>>>>> (2, 'Bauer')
>>>>> (2, 'Hockey')
>>>>> (2, 'Gloves')
>>>>>
>>>>> Our goal is to aggregate this data as follows; however, our code that
>>>>> currently does this is does not perform well. In the realistic 12K product
>>>>> scenario, this resulted in 430K document/token tuples.
>>>>>
>>>>> ((1, 2), ['Hockey'])
>>>>>
>>>>> This then tells us that documents 1 and 2 need to be compared to one
>>>>> another (via cosine similarity) because they both contain the token
>>>>> 'hockey'. I will investigate the methods that you recommended to see if
>>>>> they may resolve our problem.
>>>>>
>>>>> Thanks,
>>>>> Kevin
>>>>>
>>>>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <
>>>>> nick.pentre...@gmail.com> wrote:
>>>>>
>>>>>> How many products do you have? How large are your vectors?
>>>>>>
>>>>>> It could be that SVD / LSA could be helpful. But if you have many
>>>>>> products then trying to compute all-pair similarity with brute force is 
>>>>>> not
>>>>>> going to be scalable. In this case you may want to investigate hashing
>>>>>> (LSH) techniques.
>>>>>>
>>>>>>
>>>>>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <
>>>>>> kevin.r.mell...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'm trying to write a Spark application that will detect similar
>>>>>>> items (in this case products) based on their descriptions. I've got an 
>>>>>>> ML
>>>>>>> pipeline that transforms the product data to TF-IDF representation, 
>>>>>>> using
>>>>>>> the following components.
>>>>>>>
>>>>>>>    - *RegexTokenizer* - strips out non-word characters, results in
>>>>>>>    a list of tokens
>>>>>>>    - *StopWordsRemover* - removes common "stop words", such as
>>>>>>>    "the", "and", etc.
>>>>>>>    - *HashingTF* - assigns a numeric "hash" to each token and
>>>>>>>    calculates the term frequency
>>>>>>>    - *IDF* - computes the inverse document frequency
>>>>>>>
>>>>>>> After this pipeline evaluates, I'm left with a SparseVector that
>>>>>>> represents the inverse document frequency of tokens for each product. 
>>>>>>> As a
>>>>>>> next step, I'd like to be able to compare each vector to one another, to
>>>>>>> detect similarities.
>>>>>>>
>>>>>>> Does anybody know of a straightforward way to do this in Spark? I
>>>>>>> tried creating a UDF (that used the Breeze linear algebra methods
>>>>>>> internally); however, that did not scale well.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kevin
>>>>>>>
>>>>>>
>>>>>
>>>
>>

Reply via email to