I have 10M+ textual company names(in Chinese) that extracted from work experiences of user's profile. Because those company names are manually entered by users of our site, so there are lots of duplication. Our goal is extracting & cleansing those data to establish a company dictionary. For example, those terms should considered as one company:
Huawei Technologies Co. Ltd Huawei huawei.com 华为 -> (华为 is Huawei in Chinese) 华为有限公司 -> (有限公司 is Co. Ltd in Chinese) Looks like it's a clustering process, but i don't have any idea how can i implement it. Regards. - Jason
