Hi Mike, actually for me it sounds like a deduplication problem. I wouldn't recommend mahout for this task. There are better and faster approaches like: - SimHash based algorithms (http://matpalm.com/resemblance/simhash/) - Broder Shingles http://www.cs.brown.edu/courses/csci2531/papers/nearduplicate.pdf
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/de//pubs/archive/33026.pdf An overview can be found here: - http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bib/Princeton.pdf Hope that helps Manuel On 10.05.2012, at 23:57, mBria wrote: > Hi everyone, > > This may be a bit long, and I apologize up front. I'm new to Mahout (And > Machine Learning in general), and haven't actually built anything beyond the > MiA book's examples with it. > > I'm looking for a little nudge/guidance on where to direct my next level of > research/experimentation for a real-world problem. > > Basically, I need "document matching" support. Context laundry-list: > - "doc" is a somewhat sparse document with a set of 10-15 fields of varying > length text (usually phrases) & numerical fields. > - it's sparse in that not all fields will be valued for all docs > - docs are almost always "logical duplicates" of a few other docs (say, 2-5 > on average); we'll call a set of "dup docs" a "cluster" > - there are millions of docs (and thus many thousands of "clusters") > - although they are logical duplicates, the field values may be similar, but > are often not identical (degree of "similarity" will vary non-trivially) > - I've got an "example" document set (millions) already clustered (manually) > in production > > So, what I want to build is a system that can take NEW documents, and give > automated insight into which of the existing cluster this document belongs, > or an indication that it belongs to none. > > Initially, I saw this most as a "*CLASSIFICATION *problem": > - I've got a immense /training set/ already > - I want to "classify" new stuff based on smart /field-level similarity/ > evaluation > - I want to pick one "class" (ie, cluster) the doc belongs to > > The problem with this (maybe?) is that I'm gathering that classification > really works best for BINARY classes ("you go here, or you go there"). My > case is that there are thousands of classes (clusters), and it may even be > that the given doc doesn't really fit any of them well (in which it should > become a new cluster of one). To a lesser degree, I'd like to know I could > if I wanted get the system to tell a a small set of clusters the new doc may > fit well with with a "score". > > Looking at this then from a "*CLUSTERING *problem" angle: > - yes, I want docs "clustered" based on similarity of its field values > - but, I've already got the existing millions of docs already clustered, and > I just want to funnel news docs into the clusters > > So, while "clustering docs" is definitely the end result of the system, I > don't really think this is an obvious "clustering problem" from the > ML/Mahout POV. Least not a standard one. > > Looking at this from a "*RECOMMENDATION* problem" angle: > - I can kinda think of the existing clusters as being clusters as containing > docs "related" to the other docs in the cluster > - Then I could say this new doc is like another existing doc, which > "associates" to these other docs (in the cluster) therefore this new one > associates to those other ones (and belongs in the cluster) > > But, beyond this being a real stretch and probably silly (useless), the big > missing aspect is the ability to leverage doc field similarity. It's > advanced field value similarity which really drives the "match". So, I > don't think Recommenders help much here. > > My gut is telling me I want some hybrid of clustering and classification, > but I'm not sure. > > So, my head is still running full-speed trying see this in various ways to > see what I can use from Mahout to contribute to my system, but before I got > too far down my own rabbit holes I wanted to Ask The Expert. > > Again, sorry for the novel! > > Any ideas, references to things to look at, anything at all that you think > might be helpful would be great. Not looking for anyone to "hand me the > solution", but polling for guidance. > > Thanks much! > Mike > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Some-guidance-for-this-noob-Metadata-Matching-Engine-tp3978388.html > Sent from the Mahout User List mailing list archive at Nabble.com. -- Manuel Blechschmidt Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B
