Re: Any Entity Resolution & Deduplication solution?

Suneel Marthi Tue, 03 Dec 2013 04:33:42 -0800

You don't need clustering for this.

Lucene
 should be able to help you here to create a dictionary. Look at


a) Lucene's CJK and Standard Analyzers
b) Mahout's DictionaryVectorizer (with appropriate Lucene Analyzer)

that along with an appropriate 
choice of ngrams and Stopwords should do it for you.








On Tuesday, December 3, 2013 3:41 AM, Jason Lee <[email protected]> wrote:
 
I have 10M+ textual company names(in Chinese) that extracted from work
experiences of user's profile. Because those company names are manually
entered by users of our site, so there are lots of duplication. Our goal is
extracting & cleansing those data to establish a company dictionary. For
example, those terms should considered as one company:

Huawei Technologies Co. Ltd
Huawei
huawei.com
华为                        ->  (华为 is Huawei in Chinese)
华为有限公司                         -> (有限公司 is Co. Ltd in Chinese)

Looks like it's a clustering process, but i don't have any idea how can i
implement it.

Regards.
- Jason

Re: Any Entity Resolution & Deduplication solution?

Reply via email to