Re: Similar Items

2016-09-21 Thread Nick Pentreath
. In the realistic 12K product >>>>> scenario, this resulted in 430K document/token tuples. >>>>> >>>>> ((1, 2), ['Hockey']) >>>>> >>>>> This then tells us that documents 1 and 2 need to be compared to one >>>>>

Re: Similar Items

2016-09-21 Thread Nick Pentreath
;>> they may resolve our problem. >>>> >>>> Thanks, >>>> Kevin >>>> >>>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath < >>>> nick.pentre...@gmail.com> wrote: >>>> >>>>> How many products

Re: Similar Items

2016-09-20 Thread Peter Figliozzi
ell...@gmail.com> wrote: > Hi all, > > I'm trying to write a Spark application that will detect similar items (in > this case products) based on their descriptions. I've got an ML pipeline > that transforms the product data to TF-IDF representation, using the &

Re: Similar Items

2016-09-20 Thread Kevin Mellott
ntreath < >>> nick.pentre...@gmail.com> wrote: >>> >>>> How many products do you have? How large are your vectors? >>>> >>>> It could be that SVD / LSA could be helpful. But if you have many >>>> products then trying to compute all-pair sim

Re: Similar Items

2016-09-20 Thread Kevin Mellott
elpful. But if you have many >>> products then trying to compute all-pair similarity with brute force is not >>> going to be scalable. In this case you may want to investigate hashing >>> (LSH) techniques. >>> >>> >>> On Mon, 19 Sep 2016 at 22:4

Re: Similar Items

2016-09-20 Thread Nick Pentreath
not >> going to be scalable. In this case you may want to investigate hashing >> (LSH) techniques. >> >> >> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <kevin.r.mell...@gmail.com> >> wrote: >> >>> Hi all, >>> >>> I'm trying to w

Re: Similar Items

2016-09-20 Thread Nick Pentreath
. On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <kevin.r.mell...@gmail.com> wrote: > Hi all, > > I'm trying to write a Spark application that will detect similar items (in > this case products) based on their descriptions. I've got an ML pipeline > that transforms the product data to

Similar Items

2016-09-19 Thread Kevin Mellott
Hi all, I'm trying to write a Spark application that will detect similar items (in this case products) based on their descriptions. I've got an ML pipeline that transforms the product data to TF-IDF representation, using the following components. - *RegexTokenizer* - strips out non-word