Re: Pairwise Document Similarity

Niall Riddell Fri, 29 Jul 2011 06:05:43 -0700

Thanks Grant and apologies for not actually reading the JIRA properly it was
not intended as a criticism of the excellent Recommender work in Taste.  I
see know that the intention is to improve performance for specific use
cases.

My intention is to build a scalable near duplicate document detection
capability using Mahout.  The use case that I'm researching will require a
scalable approach across a large corpus of documents (>50m and growing).  I
see that there is another thread today on this topic.

I will go ahead and implement the solution as outlined in my original post
and delve further into the LSH implementation embedded in the clustering
code and share my findings with the community.  I'm looking to test this on
a large corpus using EMR in order to get a feeling for the timings

Cheers Niall

On Jul 23, 2011 7:55 AM, "Grant Ingersoll" <[email protected]> wrote:
>
>
> On Jul 22, 2011, at 7:23 AM, Niall Riddell wrote:
> >
> >
> > I've gone through MIA and felt the the rowsimilarityjob was a
> > possibility, however I understand that a JIRA has been raised to make
> > this potentially less general and in it's current form it may not
> > match my performance/cost criteria (i.e. high/low).
>
> I don't think the goal of the JIRA issue is to make it less general, ti's
to make the cases that can benefit from smarter use of the co-occurrences
scale better.  I see no reason why the existing format can't also be
maintained for those similarity measures that can't benefit from more map
side calculation.
>
> -Grant
>

Re: Pairwise Document Similarity

Reply via email to