Regarding whether this is classification or clustering, it is clustering but you have some initial conditions that should be used to prime the algorithm.
Manuel's links are excellent. The LSH hash based clustering in the new clustering codes could be competitive with these other methods in the event that your similarities are not close enough to make a classic dedupe algorithm work well. That could happen if you have documents that are concatenations of other documents, for instance. See https://github.com/tdunning/knn for code. See particularly the class: org.apache.mahout.knn.lsh.LocalitySensitiveHash On Fri, May 11, 2012 at 6:28 AM, Manuel Blechschmidt < [email protected]> wrote: > Hi Mike, > actually for me it sounds like a deduplication problem. > > I wouldn't recommend mahout for this task. There are better and faster > approaches like: > - SimHash based algorithms (http://matpalm.com/resemblance/simhash/) > - Broder Shingles > http://www.cs.brown.edu/courses/csci2531/papers/nearduplicate.pdf > > > http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/de//pubs/archive/33026.pdf > > An overview can be found here: > - > http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bib/Princeton.pdf > > Hope that helps > Manuel > > On 10.05.2012, at 23:57, mBria wrote: > > > Hi everyone, > > > > This may be a bit long, and I apologize up front. I'm new to Mahout (And > > Machine Learning in general), and haven't actually built anything beyond > the > > MiA book's examples with it. > > > > I'm looking for a little nudge/guidance on where to direct my next level > of > > research/experimentation for a real-world problem. > > > > Basically, I need "document matching" support. Context laundry-list: > > - "doc" is a somewhat sparse document with a set of 10-15 fields of > varying > > length text (usually phrases) & numerical fields. > > - it's sparse in that not all fields will be valued for all docs > > - docs are almost always "logical duplicates" of a few other docs (say, > 2-5 > > on average); we'll call a set of "dup docs" a "cluster" > > - there are millions of docs (and thus many thousands of "clusters") > > - although they are logical duplicates, the field values may be similar, > but > > are often not identical (degree of "similarity" will vary non-trivially) > > - I've got an "example" document set (millions) already clustered > (manually) > > in production > > > > So, what I want to build is a system that can take NEW documents, and > give > > automated insight into which of the existing cluster this document > belongs, > > or an indication that it belongs to none. > > > > Initially, I saw this most as a "*CLASSIFICATION *problem": > > - I've got a immense /training set/ already > > - I want to "classify" new stuff based on smart /field-level similarity/ > > evaluation > > - I want to pick one "class" (ie, cluster) the doc belongs to > > > > The problem with this (maybe?) is that I'm gathering that classification > > really works best for BINARY classes ("you go here, or you go there"). > My > > case is that there are thousands of classes (clusters), and it may even > be > > that the given doc doesn't really fit any of them well (in which it > should > > become a new cluster of one). To a lesser degree, I'd like to know I > could > > if I wanted get the system to tell a a small set of clusters the new doc > may > > fit well with with a "score". > > > > Looking at this then from a "*CLUSTERING *problem" angle: > > - yes, I want docs "clustered" based on similarity of its field values > > - but, I've already got the existing millions of docs already clustered, > and > > I just want to funnel news docs into the clusters > > > > So, while "clustering docs" is definitely the end result of the system, I > > don't really think this is an obvious "clustering problem" from the > > ML/Mahout POV. Least not a standard one. > > > > Looking at this from a "*RECOMMENDATION* problem" angle: > > - I can kinda think of the existing clusters as being clusters as > containing > > docs "related" to the other docs in the cluster > > - Then I could say this new doc is like another existing doc, which > > "associates" to these other docs (in the cluster) therefore this new one > > associates to those other ones (and belongs in the cluster) > > > > But, beyond this being a real stretch and probably silly (useless), the > big > > missing aspect is the ability to leverage doc field similarity. It's > > advanced field value similarity which really drives the "match". So, I > > don't think Recommenders help much here. > > > > My gut is telling me I want some hybrid of clustering and classification, > > but I'm not sure. > > > > So, my head is still running full-speed trying see this in various ways > to > > see what I can use from Mahout to contribute to my system, but before I > got > > too far down my own rabbit holes I wanted to Ask The Expert. > > > > Again, sorry for the novel! > > > > Any ideas, references to things to look at, anything at all that you > think > > might be helpful would be great. Not looking for anyone to "hand me the > > solution", but polling for guidance. > > > > Thanks much! > > Mike > > > > -- > > View this message in context: > http://lucene.472066.n3.nabble.com/Some-guidance-for-this-noob-Metadata-Matching-Engine-tp3978388.html > > Sent from the Mahout User List mailing list archive at Nabble.com. > > -- > Manuel Blechschmidt > Dortustr. 57 > 14467 Potsdam > Mobil: 0173/6322621 > Twitter: http://twitter.com/Manuel_B > >
