Hi Suneel, Manuel, Thank you so much for your advices, and sorry for my late reply.
About the methods and library you mentioned, i will try it out and let you know how it goes and any questions along the way. On Wed, Dec 4, 2013 at 3:28 AM, Manuel Blechschmidt < [email protected]> wrote: > Hi Jason, > mahout does not have any direct duplication detection capabilities. > > My former university provides a duplication detection library (dude): > > http://www.hpi.uni-potsdam.de/naumann/projekte/dude_duplicate_detection.html > > If you want to tag entities you might want to look into GATE. > http://gate.ac.uk/sale/talks/stupidpoint/diana-fb.ppt > > > Hope that helps > Manuel > > On 03.12.2013, at 09:41, Jason Lee wrote: > > > I have 10M+ textual company names(in Chinese) that extracted from work > > experiences of user's profile. Because those company names are manually > > entered by users of our site, so there are lots of duplication. Our goal > is > > extracting & cleansing those data to establish a company dictionary. For > > example, those terms should considered as one company: > > > > Huawei Technologies Co. Ltd > > Huawei > > huawei.com > > 华为 -> (华为 is Huawei in Chinese) > > 华为有限公司 -> (有限公司 is Co. Ltd in Chinese) > > > > Looks like it's a clustering process, but i don't have any idea how can i > > implement it. > > > > Regards. > > - Jason > > -- > Manuel Blechschmidt > M.Sc. IT Systems Engineering > Dortustr. 57 > 14467 Potsdam > Mobil: 0173/6322621 > Twitter: http://twitter.com/Manuel_B > >
