It's closest to a clustering problem. Because your clusters are so particular -- the elements are very close to each other, very distinct from others -- it reduces to something similar.
If you had a good similarity metric for docs, you would just match a new doc against each other doc and figure out where it's nearly-identical to an existing doc. (You could speed it up by keeping just one representative doc for each cluster.) The question is just one of constructing a similarity metric. Is it true that duplicates will match on most fields, and non-duplicates will match on virtually none? then there's your metric, and there should be some bright-line threshold between close and not-close documents. Sean On Thu, May 10, 2012 at 10:57 PM, mBria <[email protected]> wrote: > Hi everyone, > > This may be a bit long, and I apologize up front. I'm new to Mahout (And > Machine Learning in general), and haven't actually built anything beyond the > MiA book's examples with it. > > I'm looking for a little nudge/guidance on where to direct my next level of > research/experimentation for a real-world problem. > > Basically, I need "document matching" support. Context laundry-list: > - "doc" is a somewhat sparse document with a set of 10-15 fields of varying > length text (usually phrases) & numerical fields. > - it's sparse in that not all fields will be valued for all docs > - docs are almost always "logical duplicates" of a few other docs (say, 2-5 > on average); we'll call a set of "dup docs" a "cluster" > - there are millions of docs (and thus many thousands of "clusters") > - although they are logical duplicates, the field values may be similar, but > are often not identical (degree of "similarity" will vary non-trivially) > - I've got an "example" document set (millions) already clustered (manually) > in production > > So, what I want to build is a system that can take NEW documents, and give > automated insight into which of the existing cluster this document belongs, > or an indication that it belongs to none. > > Initially, I saw this most as a "*CLASSIFICATION *problem": > - I've got a immense /training set/ already > - I want to "classify" new stuff based on smart /field-level similarity/ > evaluation > - I want to pick one "class" (ie, cluster) the doc belongs to > > The problem with this (maybe?) is that I'm gathering that classification > really works best for BINARY classes ("you go here, or you go there"). My > case is that there are thousands of classes (clusters), and it may even be > that the given doc doesn't really fit any of them well (in which it should > become a new cluster of one). To a lesser degree, I'd like to know I could > if I wanted get the system to tell a a small set of clusters the new doc may > fit well with with a "score". > > Looking at this then from a "*CLUSTERING *problem" angle: > - yes, I want docs "clustered" based on similarity of its field values > - but, I've already got the existing millions of docs already clustered, and > I just want to funnel news docs into the clusters > > So, while "clustering docs" is definitely the end result of the system, I > don't really think this is an obvious "clustering problem" from the > ML/Mahout POV. Least not a standard one. > > Looking at this from a "*RECOMMENDATION* problem" angle: > - I can kinda think of the existing clusters as being clusters as containing > docs "related" to the other docs in the cluster > - Then I could say this new doc is like another existing doc, which > "associates" to these other docs (in the cluster) therefore this new one > associates to those other ones (and belongs in the cluster) > > But, beyond this being a real stretch and probably silly (useless), the big > missing aspect is the ability to leverage doc field similarity. It's > advanced field value similarity which really drives the "match". So, I > don't think Recommenders help much here. > > My gut is telling me I want some hybrid of clustering and classification, > but I'm not sure. > > So, my head is still running full-speed trying see this in various ways to > see what I can use from Mahout to contribute to my system, but before I got > too far down my own rabbit holes I wanted to Ask The Expert. > > Again, sorry for the novel! > > Any ideas, references to things to look at, anything at all that you think > might be helpful would be great. Not looking for anyone to "hand me the > solution", but polling for guidance. > > Thanks much! > Mike > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Some-guidance-for-this-noob-Metadata-Matching-Engine-tp3978388.html > Sent from the Mahout User List mailing list archive at Nabble.com.
