. In the realistic 12K product
>>>>> scenario, this resulted in 430K document/token tuples.
>>>>>
>>>>> ((1, 2), ['Hockey'])
>>>>>
>>>>> This then tells us that documents 1 and 2 need to be compared to one
>>>>>
;>> they may resolve our problem.
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> How many products
ell...@gmail.com>
wrote:
> Hi all,
>
> I'm trying to write a Spark application that will detect similar items (in
> this case products) based on their descriptions. I've got an ML pipeline
> that transforms the product data to TF-IDF representation, using the
&
ntreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
>>>> How many products do you have? How large are your vectors?
>>>>
>>>> It could be that SVD / LSA could be helpful. But if you have many
>>>> products then trying to compute all-pair sim
elpful. But if you have many
>>> products then trying to compute all-pair similarity with brute force is not
>>> going to be scalable. In this case you may want to investigate hashing
>>> (LSH) techniques.
>>>
>>>
>>> On Mon, 19 Sep 2016 at 22:4
not
>> going to be scalable. In this case you may want to investigate hashing
>> (LSH) techniques.
>>
>>
>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <kevin.r.mell...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to w
.
On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <kevin.r.mell...@gmail.com>
wrote:
> Hi all,
>
> I'm trying to write a Spark application that will detect similar items (in
> this case products) based on their descriptions. I've got an ML pipeline
> that transforms the product data to
Hi all,
I'm trying to write a Spark application that will detect similar items (in
this case products) based on their descriptions. I've got an ML pipeline
that transforms the product data to TF-IDF representation, using the
following components.
- *RegexTokenizer* - strips out non-word