You don't need clustering for this. Lucene should be able to help you here to create a dictionary. Look at
a) Lucene's CJK and Standard Analyzers b) Mahout's DictionaryVectorizer (with appropriate Lucene Analyzer) that along with an appropriate choice of ngrams and Stopwords should do it for you. On Tuesday, December 3, 2013 3:41 AM, Jason Lee <[email protected]> wrote: I have 10M+ textual company names(in Chinese) that extracted from work experiences of user's profile. Because those company names are manually entered by users of our site, so there are lots of duplication. Our goal is extracting & cleansing those data to establish a company dictionary. For example, those terms should considered as one company: Huawei Technologies Co. Ltd Huawei huawei.com 华为 -> (华为 is Huawei in Chinese) 华为有限公司 -> (有限公司 is Co. Ltd in Chinese) Looks like it's a clustering process, but i don't have any idea how can i implement it. Regards. - Jason
