Hi Jason, mahout does not have any direct duplication detection capabilities.
My former university provides a duplication detection library (dude): http://www.hpi.uni-potsdam.de/naumann/projekte/dude_duplicate_detection.html If you want to tag entities you might want to look into GATE. http://gate.ac.uk/sale/talks/stupidpoint/diana-fb.ppt Hope that helps Manuel On 03.12.2013, at 09:41, Jason Lee wrote: > I have 10M+ textual company names(in Chinese) that extracted from work > experiences of user's profile. Because those company names are manually > entered by users of our site, so there are lots of duplication. Our goal is > extracting & cleansing those data to establish a company dictionary. For > example, those terms should considered as one company: > > Huawei Technologies Co. Ltd > Huawei > huawei.com > 华为 -> (华为 is Huawei in Chinese) > 华为有限公司 -> (有限公司 is Co. Ltd in Chinese) > > Looks like it's a clustering process, but i don't have any idea how can i > implement it. > > Regards. > - Jason -- Manuel Blechschmidt M.Sc. IT Systems Engineering Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B
