Hi Mike,
actually for me it sounds like a deduplication problem.

I wouldn't recommend mahout for this task. There are better and faster 
approaches like:
- SimHash based algorithms (http://matpalm.com/resemblance/simhash/)
- Broder Shingles 
http://www.cs.brown.edu/courses/csci2531/papers/nearduplicate.pdf

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/de//pubs/archive/33026.pdf

An overview can be found here:
- http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bib/Princeton.pdf

Hope that helps
    Manuel

On 10.05.2012, at 23:57, mBria wrote:

> Hi everyone,
> 
> This may be a bit long, and I apologize up front.  I'm new to Mahout (And
> Machine Learning in general), and haven't actually built anything beyond the
> MiA book's examples with it.
> 
> I'm looking for a little nudge/guidance on where to direct my next level of
> research/experimentation for a real-world problem.
> 
> Basically, I need "document matching" support.  Context laundry-list:
> - "doc" is a somewhat sparse document with a set of 10-15 fields of varying
> length text (usually phrases) & numerical fields.
> - it's sparse in that not all fields will be valued for all docs
> - docs are almost always "logical duplicates" of a few other docs (say, 2-5
> on average);  we'll call a set of "dup docs" a "cluster"
> - there are millions of docs (and thus many thousands of "clusters")
> - although they are logical duplicates, the field values may be similar, but
> are often not identical (degree of "similarity" will vary non-trivially)
> - I've got an "example" document set (millions) already clustered (manually) 
> in production
> 
> So, what I want to build is a system that can take NEW documents, and give
> automated insight into which of the existing cluster this document belongs,
> or an indication that it belongs to none.
> 
> Initially, I saw this most as a "*CLASSIFICATION *problem":
> - I've got a immense /training set/ already
> - I want to "classify" new stuff based on smart /field-level similarity/
> evaluation
> - I want to pick one "class" (ie, cluster) the doc belongs to
> 
> The problem with this (maybe?) is that I'm gathering that classification
> really works best for BINARY classes ("you go here, or you go there").  My
> case is that there are thousands of classes (clusters), and it may even be
> that the given doc doesn't really fit any of them well (in which it should
> become a new cluster of one).  To a lesser degree, I'd like to know I could
> if I wanted get the system to tell a a small set of clusters the new doc may
> fit well with with a "score".
> 
> Looking at this then from a "*CLUSTERING *problem" angle:
> - yes, I want docs "clustered" based on similarity of its field values
> - but, I've already got the existing millions of docs already clustered, and
> I just want to funnel news docs into the clusters
> 
> So, while "clustering docs" is definitely the end result of the system, I
> don't really think this is an obvious "clustering problem" from the
> ML/Mahout POV.  Least not a standard one.
> 
> Looking at this from a "*RECOMMENDATION* problem" angle:
> - I can kinda think of the existing clusters as being clusters as containing
> docs "related" to the other docs in the cluster
> - Then I could say this new doc is like another existing doc, which
> "associates" to these other docs (in the cluster) therefore this new one
> associates to those other ones (and belongs in the cluster)
> 
> But, beyond this being a real stretch and probably silly (useless), the big
> missing aspect is the ability to leverage doc field similarity.  It's
> advanced field value similarity which really drives the "match".  So, I
> don't think Recommenders help much here.
> 
> My gut is telling me I want some hybrid of clustering and classification,
> but I'm not sure.
> 
> So, my head is still running full-speed trying see this in various ways to
> see what I can use from Mahout to contribute to my system, but before I got
> too far down my own rabbit holes I wanted to Ask The Expert.
> 
> Again, sorry for the novel!
> 
> Any ideas, references to things to look at, anything at all that you think
> might be helpful would be great.  Not looking for anyone to "hand me the
> solution", but polling for guidance.
> 
> Thanks much!
> Mike
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Some-guidance-for-this-noob-Metadata-Matching-Engine-tp3978388.html
> Sent from the Mahout User List mailing list archive at Nabble.com.

-- 
Manuel Blechschmidt
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B

Reply via email to