For example, given a new big department merged from three departments. A few employees worked for two or three departments before merging. That means, the attributes of one person might be listed under different departments' databases. One additional problem is that one person can have different first names or nick names.
These attributes of a person include first name, last name, email, home phone, cell phone, ssn, address, etc ... Because some values of the above could be empty, there is no unique primary key. Hence, we need an intelligent solution for the classification, and to put weights for different matching rules. Any tips to handle such runtime fast deduplication tasks for big data (about 100 million records)? Any open-source project working on this?